December 25, 2015

Day 25 - Laziest Christmas Ever

Written by: Sally Lehman (@sllylhmn) & Roman Fuentes (@romfuen)
Photography by: Brandon Lehman (@aperturetwenty)
Edited by: Bill Weiss (@BillWeiss)

Equipment List

This Christmas, Roman and Sally decided to use a pair of thermocouples, a Raspberry Pi, Graphite, and Nagios to have our Christmas ham call and email us when it was done cooking (and by “done” we mean an internal temperature of 140°F). The setup also allowed us to remotely monitor the oven and ham temperatures from a phone or laptop using Graphite dashboards. Since we are both busy with life and family obligations around the holidays, finding new ways to automate food preparation was considered a necessity.

Temperature Sensor Setup

The Raspberry Pi was connected to a pair of high-temp, waterproof temperature sensors by following this tutorial from Adafruit. We deviated from the tutorial in that we added an additional temperature sensor, since we wanted one for the ham and a separate one for the oven. Attaching an additional sensor required soldering together the two 3.3v voltage leads, the ground leads, and the data lines. These data and voltage lines were then bridged using a 4.7k Ohm pull-up resistor.

Prototype with data output

We used some electrical tape to hide the nominal soldering job. The tape also helped keep the connection points for the pins in place. We wrapped each soldered lead pair individually and then to each other, completing the assembly by attaching this package to the pinout.

Prototype measuring whiskey temperature

The sensors shared the same data line feeding into the Pi and conveniently show up in Linux as separate device folders.

Device Folders.

The Adafruit tutorial included a Python script that would read data from the device files and output the temperature in Celsius and Fahrenheit. The script did most of what we needed, so we made some minor modifications and ran two versions in parallel: one to measure the internal temperature of the ham and the other for the oven [1]. A follow up project would combine this into a single script.

The troubling case of network access

We planned to set the Pi next to the oven, which is not an ideal place for Ethernet cabling. Wireless networking was an obvious solution; however, the setup and configuration was not trivial and took longer than expected because the adapter would work wonderfully, and subsequently refuse to connect to anything. Our combined investigative powers and Linux sleuthing led to the solution of repeatedly ‘turning it off and back on again’. Here is the config that worked for us using Sally’s iPhone hotspot [2]. Sadly, we lost a few data points when we moved the iPhone to another room and lost network connectivity, rudely terminating the Python script. In hindsight, using the local wireless access point would have prevented this, but we were happy to have any functional network connection at that point.

The plan

The workflow of our Christmas ham monitoring and alerting system is as follows. The Python scripts would grab temperature readings (in Fahrenheit) every second, modify the output to match the format expected by Graphite, and send them to our DigitalOcean droplet via TCP/IP socket. The metrics were named christmaspi.temperature.ham and christmaspi.temperature.oven. Nagios would poll Graphite via HTTP call for both of these metrics every few minutes and would send PagerDuty notifications for any warning or critical level breach.

We first decided to run the complete suite of X.org, Apache, Nagios, Graphite, and the Python polling scripts on our Pi host ‘christmas-pi’. Installing Nagios 3 and launching the web interface was straightforward. The Graphite install, however, overwhelmed the ~3Gb SD card. We freed up some space by removing the wolfram-engine package from the default Raspbian install.

All I want for Christmas is Docker containers running Graphite

At this point, with all our services running, we were left with around 272 Mb of free RAM. Surprisingly, instead of catching fire, the Pi was quite capably displaying the Nagios and Graphite web interfaces! Each Python script was running in an infinite loop and exporting data. Two thumbs up.

Imagine our astonishment when we attempted to take a peek at our Graphite graphs and saw only broken image placeholders! Google says that we may have been missing a Python package or two. In our darkest hour, we turned to the cloud. A simple IP change would point the Python script to send data via a TCP/IP socket to any Graphite cluster we had. There also was a nice Docker Compose setup that would automagically crate a complete Graphite cluster with the phrase “docker-compose up -d” and a DigitalOcean droplet running Docker. Following the quick setup of the cluster, we were prepared to begin recording data and get nice graphs of it.

Alerting and Notifications

At this point, the remaining work to be completed was setting up Nagios and Graphite to talk to each other, and then to find a way for Nagios to alert us. To handle the call and email page outs, we signed up for a 14-day trial of PagerDuty and followed their perl script integration guide to configure it with Nagios.

Jason Dixon had created a nice Nagios poller for Graphite that we also made use of. Once the script was set as executable and following an IP change to point our data export to DigitalOcean droplet, we appended the following to the default Nagios command file. We also changed the default Nagios service configuration to set the following limits:

  • Send a warning notification if oven temperature is below 320 °F or ham temperature is above 135 °F.
  • Send a critical notification if oven temperature is below 315 °F or ham temperature is above 140 °F.

Additionally, we modified the default Nagios hostname configuration so our host was called christmas-pi.

We were now ready to turn on the gas (or electricity) and start cooking.

Robot Chef

Alas, the oven temperature stopped increasing at 260 °F, as our Graphite graphs below show. We looked at the temperature allowances for the probe - and... yep, 260 °F was the limit. A follow up project would be to locate and integrate a temperature sensor that can function at oven air temperatures.

Christmas Ham Graphite Dashboard - 4 Hours

The recommended cook time for ham is “15-18 minutes per pound”, so we estimated our 16 pound ham would need around 4 hours to fully cook. You will see in our Graphite graphs that the ham rose in temperature too quickly, alerting us long before it was legitimately done. So, we did some troubleshooting and found the difference in temperature was about 50 degrees less with the sensor being placed about 3 in deeper. We reseated the temperature probe, and went back to not being in the kitchen.

Nagios warned when the ham reached a temperature of 135°F.

christmas-pi warning

The christmas-pi sent us text messages, emails, and a phone call to let us know that the warning alert triggered. A few minutes later, christmas-pi sent a critical alert that the ham had reached an internal temperature of 140°F.

christmas-pi critical

Here is another view of these alerts. Great success!

Photo of ham, dressed to impress, in delicious pork glory

We hope you enjoyed reading about our project, and wish you and your family a happy holiday!

References

[1] Using the Python socket library to export data to Graphite [2] If that config file looks like gibberish to you, take a moment to learn Puppet or Chef

December 24, 2015

Day 24 - It's not Production without an audit trail

Written by: Brian Henerey (@bhenerey)
Edited by: AJ Bourg (@ajbourg)

Don't be scared

I suspect when most tech people hear the word audit they want to run away in horror. It tends to bring to mind bureaucracy, paperwork creation, and box ticking. There's no technical work involved, so it tends to feel like a distraction from the 'real work' we're already struggling to keep up with.

My mind shifted on this a bit ago when I worked very closely with an Auditor over several months helping put together Controls, Policies and Procedures at an organization to prepare for a SOC2 audit. If you're not familiar with a SOC2, in essence it is a process where you define how you're going to protect your organization's Security, Availability, and Confidentiality(1) in a way that produces evidence for an outside auditor to inspect. The end result is a report you can share with customers, partners, or even board members, with the auditor's opinion on how well you're performing against what you said.

But even without seeking a report, aren't these all good things? As engineers working with complex systems, we constantly think about Security and Availability. We work hard implementing availability primitives such as Redundancy, Load Balancing, Clustering, Replication and Monitoring. We constantly strive to improve our security with DDOS protection, Web Application Firewalls, Intrusion Detection Systems, Pen tests, etc. People love this type of work because there's a never-ending set of problems to solve, and who doesn't love solving problems?

So why does Governance frighten us so? I think it's because we still treat it like a waterfall project, with all the audit work saved until the end. But what if we applied some Agile or Lean thinking to it?

Perhaps if we rub some Devops on it, it won't be so loathsome any more.

Metrics and Laurie's Law

We've been through this before. Does anyone save up monitoring to the end of a project any longer? No, of course not. We're building new infrastructure and shipping code all the time, and as we do, everything has monitoring in place as it goes out the door.

In 2011, Laurie Denness coined the phrase "If it moves, graph it". What this means to me is that any work done by me or my team is not "Done" until we have metrics flowing. Generally we'll have a dashboard as well, grouping as appropriate. However, I've worked with several different teams at a handful of companies, and people generally do not go far enough without some prompting. They might have os-level metrics, or even some application metrics, but they don't instrument all the things. Here are some examples that I see commonly neglected:

  • Cron jobs / background tasks - How often do they fail? How long do they take to run? Is it consistent? What influences the variance?

  • Deployments - How long did it take? How long did each individual step take? How often are deploys rolled back?

  • Operational "Meta-metrics" - How often do things change? How long do incidents last? How many users are affected? How quickly do we identify issues? How quickly from identification can we solve issues?

  • Data backups / ETL processes - Are we monitoring that they are running? How long do they take? How long do restores take? How often do these processes fail?

Now let's take these lessons we've learned about monitoring and apply them to audits.

Designing for Auditability

There's a saying that goes something like 'Systems well designed to be operated are easy to operate'. I think designing a system to be easily audited will have the same effect. So if you've already embraced 'measure all things!', then 'audit all the things!' should come easily to you. You can do this by having these standards:

  • Every tool or script you run should create a log event.
  • The log event should include as much meta-data as possible, but start with who, what and when.
  • These log events should be centralized into something like Logstash.
  • Adopt JSON as your logging format.
  • Incrementally improve things over time. This is not a big project to take on.

While I wrote this article, James Turnbull published a fantastic piece on Structured Logging.

Start small

The lowest hanging fruit comes from just centralizing your logs and using a tool like Logstash. Your configuration management and/or deployment changes are probably already being logged in /var/log/syslog.

The next step is to be a bit more purposeful and instrument your most heavily used tools.

Currently, at the beginning of every Ansible run we run this:

pre_tasks:
  - name: ansible start debug message #For audit trail
    shell: 'sudo echo Ansible playbook started on {{ inventory_hostname }} '

and also run this at the end:

post_tasks:
  - name: ansible finish debug message #For audit trail
    shell: 'sudo echo Ansible playbook finished on {{ inventory_hostname }} '

Running that command with sudo privileges ensures it will show up /var/log/auth.log.

Improve as you go

In the first few years after Statsd first came out, I evangelized often to get Dev teams to start instrumenting their code. Commonly, people would think of this as an extra task to be done outside of meeting the acceptance criteria of whatever story a Product Manager has fed to them. As such, this work tended to be put off till later, perhaps when we hoped we'd be less busy (hah!). Don't fall into this habit! Rather, add purposeful, quality logging to every bit of your work.

Back then, I asked a pretty senior engineer from an outside startup to give a demo of how he leveraged Statsd and Graphite at his company, and it was very well received. I asked him what additional amount of effort it added to any coding he did, and his answer was less than 1%.

The lesson here is not to think of this as a big project to go and do across your infrastructure and tooling. Just begin now, improve whatever parts of your infrastructure code-base you're working in, and your incremental improvements with add up over time.

CloudTrail!

If you're working in AWS, you'd be silly not to leverage CloudTrail. Launched in November 2013, AWS CloudTrail "records API calls made on your account and delivers log files to your Amazon S3 bucket."

One of the most powerful uses for this has been tracking all Security Group changes.

Pulling your CloudTrail logs into Elasticsearch/Logstash/Kibana adds even more power. Here's a graph plus event stream of a security rule being updated that opens up a port to 0.0.0.0/0. Unless this rule is in front of a public-internet facing service, it is the equivalent of chmod 0777 on a file/directory when you're trying to solve a permissions problems.

It can occasionally be useful to open things to the world when debugging, but too often this change is left behind in a sloppy way and poses a security risk.

Auditing in real-time!

Audit processes are not usually a part of technical workers' day-to-day activities. Keeping the compliance folks happy doesn't feel central to the work we're normally getting paid to do. However, if we think of the audit work as a key component of protecting our security or availability, perhaps we should be approaching it differently. For example, if the audit process is designed to keep unwanted security holes out of our infrastructure, shouldn't we be checking this all the time, not just in an annual audit? Can we get immediate feedback on the changes we make? Yes, we can.

Alerting on Elasticsearch data is an incredibly powerful way of getting immediate feedback on deviations from your policies. Elastic.co has a paid product for this called Watcher. I've not used it, preferring to use a Sensu plugin instead.

{
  "checks": {
    "es-query-count-cloudtrail": {
      "command": "/etc/sensu/plugins/check-es-query-count.rb -h my.elasticsearch  -d 'logstash-%Y.%m.%d' --minutes-previous 30 -p 9200 -c 1 -w 100  --types "cloudtrail"  -s http  -q 'Authorize*' -f eventName --invert",
      "subscribers": ["sensu-server"],
      "handlers": ["default"],
      "interval": 60,
      "occurrences": 2
    }
  }
}

With this I can query over any time frame, within a subset of event 'types', look for matches in any event field, and define warning and critical alert criteria for the results.

Now you can find out immediately when things are happening like non-approved accounts making changes, new IAM resources being created, activity in AWS regions you don't use, etc.

Closing Time

It can be exceptionally hard to move fast and 'embrace devops' and actually follow what you've documented in your organizations controls, policies, and procedures. If an audit is overly time consuming, even more time is lost from 'real' work, and there's even more temptation to cut corners and skip steps. I'd argue that the only way to avoid this is to bake auditing into every tool, all along the way as you go. Like I said before, it doesn't need to be a huge monumental effort, just start now and build on it as you go.

Good luck and happy auditing!

Footnotes

(1) I only mentioned 3, but there are 5 "Trust Service Principles", as definied by the AICPA

December 23, 2015

Day 23 - This Is Why We Can't Have Nice Things

Written by: Tray Torrance (@torrancew)
Edited by: Tom Purl (@tompurl)

TLS Edition

Preface

A previous job gave me the unique experience of building out the infrastructure for a security-oriented startup, beginning completely from scratch. In addition to being generally novel, this gave me the experience to learn a tremendous amount about security best practices, and TLS in particular, during the leaks of documents exposing various mass surveillance programs and cutely-named vulnerabilities such as "Heartbleed". Among our lofty goals was a very strict expectation around what protocols, ciphers and key sizes were acceptable for SSL/TLS connections (for the rest of this article, I will simply refer to this as "TLS"). This story is based on my experience implementing those standards internally.

Disclaimer

TLS is a heavy subject, and it is very easy to feel overwhelmed when approaching it, especially for a newcomer. This is NOT a guide to simplify that problem. For that type of help, see the resources section at the bottom of this article. To keep the tone lighter (and ideally combat the fact that TLS cipher recommendations can change at the drop of a single exploit), and hopefully more approachable for folks with less hands-on experience with these problems, I will use a single cipher, referred to by OpenSSL as ECDHE-RSA-AES256-GCM-SHA384 when demonstrating how a certain library represents its ciphers, along with amusing placeholders in my examples.

This will also (hopefully) help combat the fact that while this post may live for many years, TLS cipher recommendations can change. For the uninitiated, the above cipher string means:

  • Key Exchange: ECDHE with RSA keys

  • Encryption Algorithm: 256-bit AES in GCM mode

  • Signature: SHA384

These are the three components you need for a TLS-friendly cipher.

The Goal

With regards to TLS, our main objectives, both for ourselves and our customers, were:

  • Secure all internal and external communications with TLS

  • Enforce a mandatory key size

  • Enforce a consistent, strong set of ciphers across all TLS connections

For cipher selection, we effectively chose a subset of the

Mozilla-recommended "modern" ciphers for reasons that are out of scope for this post.

The Solution

With a modern configuration management solution (we used Puppet), this seems like a pretty trivial problem to solve. Indeed, we quickly threw together an NGINX template that enforced our preferences, and life was good.

...

ssl_ciphers ACIPHER_GOES_HERE:ANOTHER_GOES_HERE;

...

The Problem: EINCONSISTENT

So, that's it, right? Well, no. Anyone who's had to deal with these types of

things at this level can tell you that sooner or later, something will come

along that doesn't interoperate. In our case, eventually we added some software

that more or less required the use of Apache instead of NGINX, and now we had

a new config file in which to manage TLS ciphers.

The (Modified) Solution

Frustrated, we did what any good DevOps engineer does, and we abstracted the

problem away. We drew a puppet variable in at "top scope", which let us

access it from any other part of our codebase, and then referenced it from both

our Apache and NGINX templates:

# manifests/site.pp

$ssl_ciphers = 'YOU_GET_ACIPHER:EVERYBODY_GETS_ACIPHER'



# modules/nginx/templates/vhost.conf.erb

...

ssl_ciphers <%= @ssl_ciphers %>

...



# modules/apache/templates/vhost.conf.erb

...

SSLCiphers <%= @ssl_ciphers %>

...

The (Second) Problem: EUNSUPPORTED

As these things tend to go, after a few weeks (or maybe months, if we were

lucky - I don't recall) of smooth sailing, another compatibility issue was

introduced into our lives. This time, it was JRuby. For reasons that will be elaborated upon below, JRuby cannot use the OpenSSL library to provide its TLS support in the way that "normal" Ruby does. Instead, JRuby maintains a jopenssl library, whose purpose is to provide API-compatibility with Ruby's OpenSSL wrapper. Unfortunately, the library JRuby does use has a different notation for expressing TLS ciphers than OpenSSL, so jopenssl maintains a static mapping. Some of you may be groaning right about now, but wait - there's more!

In addition to not supporting some of the more modern ciphers we wanted to use (though it happily ignored them when specified, which was in this case helpful), feeding it malformed versions of the "magic" (aka ALL, LOW, ECDHE+RSA, etc) names supported by OpenSSL seemed to cause it to support any cipher that it understood - several of which are no longer secure enough for serious use.

This is why we can't have nice things.

The (Second) Solution

We had some pretty intelligent and talented folks attempt to patch this, but

they were unsuccessful at unwinding the rather complicated build process by

which JRuby tests and releases jopenssl. Ultimately, we decided that since

the JRuby application was internal only, that we could extend our policy for

internal services to include the strongest two ciphers JRuby supported at the

time. This meant adding another top scope puppet variable for use there:

# manifests/site.pp

...

$legacy_ssl_ciphers = "${ssl_ciphers}:JRUBY_WEAK_SAUCE"

...

And then, once more, referencing it in the proper template.

The (Third) Problem: ENOMENCLATURE

After another brief reprieve, along came a "proper" Java application. You may recall that I mentioned JRuby cannot use OpenSSL - well, this is because the JVM, being a cross-platform runtime, provides its own implementation of the TLS stack via a set of libraries referred to as JSSE (Java Secure Socket Extension). Now, for a brief digression.

OpenSSL cipher names are, you see, only barely based in reality. TLS cipher names are defined in RFCs, and our scapegoat cipher's official name, for example, is TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384. In my experience, OpenSSL almost always (annoyingly) ignores these names in favor of one they themselves make up. JSSE on the other hand, (correctly, though I rarely use that word in the context of Java) uses the RFC names.

Still with me? Great. As you may have put together, adding a Java app meant that our existing variable was not going to be able to do what we needed on its own. Attempts to cobble together a programmatic mapping via files describing RFC names and tricks with OpenSSL syntax were fairly successful, but unidirectional, relatively brittle and prone to needing manual updates of the RFC name list in the future.

The (Third) Solution

As you may have guessed, it's certainly simple enough to do the following, using Java-compatible cipher names:

# manifests/site.pp

...

$java_ssl_ciphers  = 'IDK:WHO:I:EVEN:AM:ANYMORE'

...

And then use that variable as needed in the various Java templates, which are almost always XML.

The (Fourth) Problem: EERLANG

As I'm sure you've guessed by now, after a bit of smooth sailing, something

else came along. This time, it was RabbitMQ, which is written in Erlang.

Rabbit (and possibly other Erlang tools) support SSL cipher configuration via

an array of 3-element tuples. In RabbitMQ, our example cipher would be

expressed as: {ecdhe_rsa,aes_256_gcm,sha384}. Now, let me first say that, academically, this is very clever. Practically, though, I want to start a pig farm.

The (Fourth) Solution

At this point, a hash, or really any data structure, was starting to look more appealing than a never-ending string of arbitrarily named global variables, so a refactor takes us to:

# manifests/site.pp

$ssl_ciphers = {

  'openssl' => 'MY:HANDS:ARE:TYPING:WORDS',

  'jruby'   => 'OH:MY:DEAR:WORD:WHY?!?',

  'java'    => 'PLEASE:MAKE:IT:STOP',

  'erlang'  => "[{screw,this,I},{am,going,home}]"

}

The Conclusion

Well, by now, you have possibly realized that everything is terrible and I need a drink (and perhaps a bit of therapy). While that is almost certainly true, it's worth noting a few things:

  • TLS, while important, can be very tedious/error-prone to deploy - this is why best practices are important!

  • A good config management solution is worth a thousand rants - our use of config management reduced this problem to figuring out the proper syntax for each new technology rather than reinventing the entire approach each time. This is more valuable than I can possibly express, and can be achieved with basically any config management framework available.

While we can't currently have nice things (and TLS is not the only reason), tools like Puppet (and its many alternatives, such as Chef, Ansible or Salt) can let us start thinking about a world where we can. With enough traction, maybe we'll get there one day. For now, I'll be off to find that drink.

References

Acknowledgements

I'd like to thank the following people for their help with this, regardless of whether they realized or volunteered to do so.

  • My wonderful partner Mandy, for her endless support (and proofing early drafts of this)
  • My SysAdvent editor, Tom, for being flexible and thorough
  • My former colleague, William, who inadvertantly mentored much of my TLS education

December 22, 2015

Day 22 - Simplicity in Complex Systems

Written by: Joshua Zimmerman (@thejewberwocky)
Edited by: Michelle Carroll (@miiiiiche)

Simplicity as a reaction to complexity

Complexity is inherent in IT. We work with tools and technology that are objectively complex, in organizations made up of complex structures of complex people, and can only achieve the goals of the organization with complex interactions. We build this complexity into the systems that we architect. We justify most complexity, believing that it may be needed to accomplish our goals. We disregard this complexity until it is sufficiently difficult to accomplish our goals. We talk about complexity a lot.

However, we often don’t talk about simplicity until the complexity in our systems causes problems that frustrate us or block our goals. Engineers often talk about simplicity only as it relates to complexity. We’ll use simplicity to sell someone on a product or idea, but we only adopt it because the tool lessens our complexity. We lack a common understanding of simplicity, and forget to talk about simplicity on its own merits.

Let’s change that.

Before diving into a real discussion of simplicity and complexity, I need to reiterate the tools we use in the IT landscape are inherently complex, but not as complex as the systems of people that create and maintain these technologies. There is no true dichotomy between simple and complex: in the simplest cases, things exist on a spectrum of the two. Therefore, it’s impossible to use basic definitions of simplicity and complexity. If we were constrained to those definitions, we would never be able to call the tools we use in IT simple, because tools like hammers and levers exist.

This leaves us with the question, how do we define simplicity for our complex technological and cultural systems? By discussing simplicity in a more abstract way, I hope it becomes easier to apply simplicity to our applications, systems, and the organizations and communities that support them. To frame our conversation, let's explore John Maeda’s *Laws of Simplicity. *

The Laws of Simplicity and modern IT

Laws of Simplicity was published in 2006, when Maeda was still a professor at MIT’s Design Lab. Simplicity was already a trend in the design community. After a couple years of iteration, the iPod was in what we now call its "classic" design. This simple interface design stayed largely the same for a decade. Google, with its minimalist front page, had emerged as the dominant search engine, and the term “googling” was already proof of its popularity. Maeda had been working on distilling his ideas about simplicity in design into laws that would help people discuss, understand, and achieve the concept, in design and beyond. He believed these laws were applicable outside of the realms of design. The result of that effort is Laws of Simplicity. Maeda treats the book as the start of an ongoing discussion — one need not follow all the laws, nor need to consider them a complete set.

Law 1 - Reduce: The simplest way to achieve simplicity is through thoughtful reduction.

The first step in simplifying something is to remove what is not necessary. What do we do when we have gotten rid of all the obviously unnecessary or unused pieces, and still have something complex? There are other ways to reduce a system. Maeda suggests a few, with the acronym SHE: Shrink, Hide, Embody.

We have a strong reaction when something is physically or conceptually large. When we shrink things, they seem simpler. A desktop may seem more complex than a tablet, despite having very similar underlying technology. This type of comparison works with more abstract concepts. We use terms like "lightweight" and “portable” to describe containers, and before Docker became popular, we used those same terms to describe VMs. We conflate simplicity and smallness; we use the word “lightweight” even though a container has no mass. By shrinking the conceptual size of a system, it seems simpler.

Hiding the complexity serves a similar purpose as shrinking. We see this often in IT, through menus in a UI, or interacting with a service through the use of an API. When we implement these things, we usually are not removing any of the complexity of the system. However, we simplify working with that system by only calling upon the complexity when needed.

However, you can’t just reduce, shrink, and hide your systems and applications. The final product must embody value and quality equal to what you are removing. When organizations first implement a central logging system, some people will still want to go to the original host and grep the log file. They see more value in the old way of doing things than in the new system. You need to communicate the value of the change, or risk sprawling complexity.

Law 2 - Organize: Organization makes a system of many appear fewer.

Human beings tend towards organization as a way to simplify things, and it’s easy to see this tendency in our attempts to organize our applications and technical systems. We organize sets of functions into libraries. Service-oriented architecture and microservices are employed to break monoliths into smaller, more sensible buckets. Service discovery tools were created to easily find, organize, and manage like systems.

As an industry, we’re actually pretty good about reorganizing systems when the complexity causes us pain. We sometimes forget that organization is primarily about grouping. The goal is to simplify the set of things you’re organizing by combining things into related, sensible categories. If you have too many categories, you haven’t adequately simplified. If you end up with too few categories, you may have oversimplified and added complexity to the system.

Maeda relates organization to Gestalt psychology. Our brains allow us to extrapolate and form patterns from partial evidence. If you see three lines connected at 90 degree angles, your brain will fill in the fourth line to create a square. Maeda argues that you can create Gestalt groupings to increase simplicity, but you may make that group more abstract. For example, authentication and authorization solutions are often coupled together. This coupling can allow you to store information on users in one place across many applications that need to consume it. However, the individual applications lose the ability to collect user information specific to it.

Law 3 - Time: Savings in time feel like simplicity.

Maeda says that waiting for things makes life feel needlessly complex. It’s hard to disagree. Spending time in meetings where you learn nothing new and make no decisions feels like added complexity in your day. An alert going off for something you can’t control feels like an unnecessary distraction. Time that you spend babysitting a process to completion in front of a screen feels wasted. A developer waiting days — or worse, weeks — for an environment reveals needless complexity in your organization. Automation has been making the lives of IT professionals easier for decades. Even if it did not produce more consistent results, we would still implement it because it reduces the time and cognitive load needed to do repetitive tasks.

Automation can't reduce all of our waiting time, but we can make that waiting time more tolerable. For example, we allow CI to run our tests for us so that we can do other things while they run. We can find ways to imbue the waiting time with more value so it feels less wasted. Many people feel that writing documentation when it’s not their primary role is a waste of their time. You can retroactively ascribe more value to the time people spend writing by making sure to thank the writer when a document they wrote is useful to you. That impact makes their next writing session feel more valuable.

There is another component to the law of time that is more humanitarian in nature: You should be doing what you can to save time for other people. You can simplify the lives of others by saving them time. Some of the most common ways to do this are:

  • Automate your systems
  • Help others automate the systems that affect them
  • Ask people for help finding bottlenecks
  • Create and send out meeting agendas in advance, and keep the meeting on topic
  • Make sure your meetings do not go over the allotted time

When you work towards saving time for others, they have more time and mental energy, and are more likely to want to help you reduce the wasted time in your day.

Law 4 - Learn: Knowledge makes everything simpler.

We have all been in the situation where we learn how something works, and say, "Oh, that’s so simple! Why didn’t I understand it before?" The trick is it probably wasn’t that simple— you’ve just gained the requisite knowledge to understand the subject. Even more confusing, once you understand something, it is often difficult to view its complexity from the eyes of a beginner. Pamela Vickers’ talk, “Crossing The Canyon Of Cognizance: A Shared Adventure,” is a great reminder of what getting started in our industry can be like.

Nothing adds complexity to an application or system quite like somebody who does not understand it, yet has been tasked with working on it. The fault does not lie with them! Knowledge and learning are everyone’s responsibility. When we can’t make a system simple enough for people to intuit, we must take responsibility for explaining it. It is our responsibility to become better teachers.

Maeda, who was a professor at the time that he wrote The Laws of Simplicity, stresses learning and teaching the basics. In tech, we have a tendency to undervalue our teaching processes by trying to be efficient and quick. Whether you are teaching people new to the profession, training juniors, or onboarding new employees, your focus should be on helping someone else become a valuable member of your organization or community. On an organizational level, you should be iterating on your processes and learning how to help people build these foundational skills.

Maeda also stresses repetition. Most people don’t fully understand something after doing it only once. That’s why math teachers have students do similar problems over and over. When learning something new — whether a tool, a language or the way applications and systems are architected at a new job — be prepared to not understand things immediately, and take the time to go over things again and again until you’ve learned them. When you are teaching someone, expect to repeat yourself a lot. Have patience. Your student may not ask for help when they need it if you appear frustrated by repeating information.

Maeda recommends we avoid creating desperation for the learner. In the tech industry, we are awful at this. Organizations throw new employees into the on-call rotations before they feel confident in their abilities to fix issues. People tell new folks to RTFM, regardless of the quality of the documentation. We have created a culture that encourages impostor syndrome. People fear being outed as an impostor, as someone who doesn’t really know enough to belong, and may not ask questions for fear of being found out. When we create an environment in which someone is both desperate and without knowledge, we intimidate that person and make it more difficult for them to ask questions and to learn. Consider what you can do to create a more supportive learning environment. Run game days with new employees before putting them on call. Read the manual with the person you’re training, and if the information is difficult to find or outdated, teach them and fix the documentation. Understand impostor syndrome and empathize with those that have it.

Law 5 - Differences: Simplicity and complexity need each other.

I started this article lamenting how we only talk about simplicity in the context of complexity. The concepts of simplicity and complexity have a… complex relationship. It is easy to think of them as two ends of a spectrum, but something can be simple in some dimensions and complex in others. It is rare that any system is purely simple or complex, and you should design your systems with that in mind.

We need to be cognizant of the symbiotic relationship between the two concepts. Sometimes, to make something more simple, you must make something else more complex. Making a change to simplify the UI may make the codebase more complex, as you have to intuit more about what the user wants. You may wish to simplify that codebase by using a library for that language, but that library may have dependencies, and all of those libraries need to be deployed and maintained. Microservices and service-oriented architecture make every individual application simpler and easier to maintain, but create more complexity when it comes to managing the interactions between services and the people who maintain and manage them.

Often, we simplify things for ourselves at the cost of complicating things for others. We need to have empathy for those to whom we are pushing this complexity. How you communicate the transfer of complexity is important, and predicts how the application functions in the long term. You should feel comfortable asking and discussing questions like the following:

  • Does the other person or team have the time or knowledge to deal with the additional complexity?
  • Can you help them learn?
  • Can you help simplify the complexity through any other means?
  • Can you take on additional complexity elsewhere to save them the time required to take on this new complexity?

One of the most important things that Maeda says about this law is the reminder that we do desire complexity from time to time. If our jobs were always simple, many of us would quickly grow bored. Too much simplicity is boring and may not help us accomplish our goals, yet too much complexity can waste time and cause frustration. Balance is important.

Law 6 - Context: What lies in the periphery of simplicity is definitely not simple.

The law of context reminds us that focusing on what we feel is relevant allows us to simplify, but there could be important things in the periphery that should not be removed.We tend to approach simplification of an existing system with all of the hindsight and correspondence bias that we can muster, usually because the system is causing us frustration or wasting our time. We think, "Clearly what the people before me did is wrong, and they did it because they were lazy or bad at their job!" Gaining the knowledge of why a system is the way it is, and how it came to be that way can be as important to simplifying the system as learning it in the first place.

Maeda points out that the absence of something can provide context and meaning. Google.com is a good example. They made the active choice to make their website primarily whitespace, to focus the user on the few things they put on the page. When a decision was made not to do something, it is always worth finding out why.

Law 7 - Emotion: More emotions are better than fewer.

Emotions are complex. They are one of the primary ways that people interact with the world around them. They don’t necessarily follow logic or reason. Trying to simplify or ignore emotions — whether your own or someone else’s— is an anti-pattern. Rather, endeavor to understand and accept them.

Empathy and caring for the individual has been a theme in software development for over a decade. It is a fundamental idea of DevOps. The first value listed in the Agile Manifesto is, "Individuals and interactions over processes and tools." You, and the individuals in your organization, are more important than simplicity, and more important than software. Disregarding how somebody feels will almost always result in a worse product, because you’ll never get a chance to include their knowledge and opinions in the process.

Empathy is one of the most powerful tools we have. It allows us to share and understand emotions. This understanding simplifies our interactions with those around us, and improves how we communicate, enabling us to create better products. When working in a complex system of people trying to produce a shared product, it is okay to sacrifice the simplicity you see in a product for the emotions of others.

Law 8 - Trust: In simplicity we trust.

It is easy to trust something that is simple. Take an application like Etsy’s Deployinator. The UI of this app basically amounts to: click a button corresponding to the environment you would like to deploy to and watch it go. It’s immediately clear how to choose the application you want to deploy, and you can trust that it will just work. It goes the other way as well: Imagine asking someone who had never seen Nagios before to write a check for something without any documentation. Are they going to trust their solution? People will trust your output if you can make the process simple and approachable — whether it’s building an environment, building a tool, designing an architecture, or creating a process.

There are other ways to build in simplicity to facilitate this trust. When you log in to Slack for the first time, you see pop-ups that explain all the functionality that you may wish to know about. They build knowledge that makes you confident you are using the tool correctly. You can reduce the complexity needed to interface with your system. Luke Francl gave a great talk at DevOps Days Minneapolis about helping developers monitor their own applications. They enabled their developers to write Nagios checks in the same language they were writing their application and in a manner that was simple and straightforward to them. Building simplicity into your systems will help people trust them.

Trust also helps to simplify things. Trust allows us to use APIs to applications. It allows us to download and use modules to extend languages and applications. It allows us to use open source software. We make the active choice to use these things because we trust them to work and continue working, and we do it because they simplify the way we write our software and create our systems. We also make the choice to trust the people that we work with. Trusting others on your team and other teams in your organization saves you time, and simplifies what you have to do. When you trust others, you can ask for help when you lack the knowledge to do things, you feel enabled to share your emotions, you have the ability to provide context to your problems and explore the differences of complexity and simplicity in your system together. When others trust you, you feel empowered to reduce and reorganize the complexity you know about as needed and you save time because they are enabling you to do what you think is best. Trust simplifies the way people work together and communicate in a complex organization. By Conway’s law, the simplifications that you see in your organization should become apparent by simplifications in the systems that you create.

Law 9 - Failure: Some things can never be made simple.

Not everything can be simplified. You should expect to fail at simplifying things from time to time. Even if you do succeed in simplifying something, as was talked about in Law 5, you may just create complexity elsewhere. Due to the complexity of the organizations that many of us work in, it is always important to remember that success for some can be failure for others.

Tech organizations have been learning to talk about failure in more constructive ways in recent years. Borrowing from other fields such as safety engineering, we’ve learned how to learn from and be accepting of failure. If you fail at creating the simplicity that you set out to create, as long as you learn from the process, you succeed in creating the simplicity that comes from having a better understanding of your systems and organization, and your next endeavors at creating simplicity will be more informed.

Law 10 - The One: Simplicity is about subtracting the obvious, and adding the meaningful.

The final law needs no significant explanation, but it deserves repeating. When in doubt, simplicity is about subtracting the obvious, and adding the meaningful. You know where the pain points caused by complexity are in your applications, in your operational infrastructure and in your organizational culture. Remove the things that are clearly not needed and work at improving the areas that would benefit your users, your colleagues and yourself.

The Three Keys

After organizing his thoughts into the ten laws, Maeda was left with three ideas that did not fit neatly into the laws and were all related to technology. These ideas became the three keys.

Key 1 - Away: More seems like less by simply moving it far away.

While it is clear that simplifying things sometimes just moves complexity around, sometimes you can break that complexity away from your systems. SaaS, PaaS, and IaaS solutions exist in many respects to take complexity away from our organizations.

Of course, there are many things to consider with these services. We are removing something, so we should be thoughtful about if we are adding value equal to what we remove. For an organization with a large, seasoned Ops team with automation in place, moving to a PaaS may not provide enough value. Does this service save us time? What do we need to learn to use this service? How do we and our colleagues feel about this change? Do we trust the vendor?

Key 2 - Open: Openness simplifies complexity.

Openness runs through all of the laws. Being open enables us to share in easier, more meaningful ways. Through sharing knowledge and emotions, we learn, reduce time spent doing the same thing multiple times, and we can build trust.

Maeda uses open source as his primary example, showing that sharing our code across our communities simplifies our lives. More interesting to me are the tools and ideas that we are building around openness. Chatops simplifies the communication in your organization by using chatrooms to manage your automated tools. It helps build and retain context for events going on in your system. But it is not enough to just leverage the automation. Your organization needs to welcome a wider group of people to more conversations. Most of your chatrooms should be open to anyone in the organization. Direct messages should be discouraged unless necessary.

Key 3 - Power: Use less, gain more.

Maeda uses this key to talk about our reliance on electricity and our devices. While I think it is important to discuss our industry’s reliance on massive amounts of electricity, this article is not the place. However, I’d like us to focus on the time we spend connected to our devices.

We know that it’s common for people in tech to work more than 40 hours a week, to not take (or receive) much vacation, to have on-call rotations that border on unhealthy. We force ourselves to be available by being connected to our devices. However, research and experimentation have shown that these are all anti-patterns. The 40 hour work week was based off research finding the maximum amount of time someone could work before the quality of their work decreased. Organizations have begun experimenting with minimum vacation policies that by all accounts are healthier for both the organization and employees. By disconnecting our employees from their devices appropriately, we improve the organization.

By creating simpler systems, we lessen the load on people in our organizations.. Planning to not have people, or a specific person, around forces us to simplify our systems. We focus on sharing knowledge so more people can fix an issue. We learn to trust a wider range of our colleagues. The emotions people have toward work are more positive. We learn to better cope with failure. We gain our personal time back, making the time we spend at work more simple.

Conclusion

You’ve likely noticed that I haven’t given a lot of specific advice on how to simplify a complex system. Every server, every application, every organization is a unique system of extraordinary complexity. Each deserves its own conversation on simplicity. We need to build up our ability to discuss and think critically about simplicity. We should prioritize simplicity in planning, instead of accidentally creating complexity.

You may have also noticed that the laws reflect more upon the people who create and maintain these systems than the systems themselves. Culture and communication are inextricably linked to the things an organization creates. When looking to simplify something, consider whether the complexity is a reflection of the culture of your organization. Solving cultural issues in your organization will decrease unnecessary complexity.

Thanks

A big thanks to the organizers of SysAdvent for running this awesome event, my editor Michelle Carroll for being a wonderful collaborator on this article and making it far more concise and coherent, John Maeda for his book that allowed me to develop these ideas and my older brother for telling me to read it in the first place.

December 21, 2015

Day 21 - Dynamically autoscaling a jenkins slave pool on EC2

Written by: Ben Potts (@benwtr)
Edited by: Adam Compton (@comptona)

At $lastjob, I configured a cluster of Jenkins slaves on EC2 to dynamically autoscale to meet demand during the working day, while scaling down at night and over weekends to cut down on costs.

There is a plugin that has very similar functionality, but I wasn't aware of its existence at the time and it's not as flexible or as much fun.

There were a few fun and interesting ingredients in this recipe:

  • Generating AMIs for the slaves and rolling out slave instances in an autoscaling group (ASG) with Atlas, Packer, Terraform, and Puppet
  • Setting up Netflix-Skunkworks/dynaslave-plugin so slaves register themselves with the Jenkins master
  • After regenerating slave AMIs with new configuration, rolling through the ASG so slaves terminate and get replaced with the new configuration while avoiding terminating Jenkins jobs that may be running on the slaves
  • Pulling metrics from Jenkins about build executors and job queuing, then pushing them to CloudWatch
  • Using Autoscale Lifecycle Hooks to make Autoscaling wait for any Jenkins jobs to finish running on a terminating slave, instead of terminating it immediately
  • Tuning the autoscaling to ensure it's cost effective

To anyone unfamiliar with how EC2 AutoScaling works, here's a rough and possibly oversimplified explanation: An AutoScaling Group manages a pool of instances on EC2. It will replace unhealthy instances, distribute instances evenly across availability zones (AZ), allow you to use spot instances and of course dynamically grow or shrink the pool. Some of the key configuration parameters on an ASG are: MinSize (minimum instances in the group), MaxSize (maximum instances in the group), DesiredCapacity (how many instances should be running in the group, this is an optional parameter), LaunchConfigurationName (the AMI and configuration of the instances in the pool). These parameters can be changed at any time and AWS will adjust the number of instances appropriately. Dynamic Autoscaling is the automatic manipulation of these parameters by CloudWatch alarms and Autoscaling policies triggered by CloudWatch metrics.

And for anyone unfamiliar with Jenkins and Jenkins slaves. Slaves are simply servers that the master Jenkins instance can execute jobs on. Slaves have "labels" which are just arbitrary user defined tags, jobs can be configured to "execute only on slaves with label foo".

Setting the Scene

The Jenkins cluster I worked on needed at most about 25 slaves running to avoid queuing and waiting. There were several types of slaves with different labels for different uses, but ~20 of that 25 ran most of the jobs. This is the pool that needed autoscaling; we'll say these slaves have the label workhorse. One thing special about these slaves is that they are small instances and they are configured to have only a single build executor each. That is, they only run one job at a time. This is a bit unusual but it simplifies some of these problems so I'll use it in any examples and explanations here.

We had Puppet code for installing and configuring a system to be used as a slave but needed to automate the rest of the process: bringing up an EC2 instance, running [masterless] puppet and some shell scripts on it, generating the AMI, creating a new launch configuration that points to the AMI, updating the Autoscaling group to use this new launch configuration for new instances, then finally terminating all the running instances so that they get replaced by new instances.

Building the Jenkins slaves

Generating an AMI by running Puppet on an EC2 instance is basically a textbook use case for Packer, automating this step was a snap.

Creating a launch configuration and updating the ASG to point at it? Hi Terraform! Once I had the create_before_destroy bits figured out, this was also easy.

I used Atlas to glue it together and fully automate the pipeline for delivering new Jenkins slave AMIs. I could push a button and Atlas would run Packer to generate an AMI, push metadata about the AMI to Atlas's artifact store then kick a terraform plan run to handle the launch config and ASG, using the latest AMI ID fetched from the artifact store. A notification would be sent to our chat asking for the output of the plan to be acknowledged and applied.

Registering newly-build slaves

Not going to go into too much detail here, but an important piece of this puzzle is a mechanism to allow slaves to register themselves with the Jenkins master. We started off using the Swarm plugin but had some problems and switched to Netflix-Skunkworks/dynaslave-plugin. Batteries aren't really included with this plugin, and the scripts that come with it are almost pseudocode, they look like shell scripts but they are non-functional examples and maybe even a bit misleading. But it was smooth sailing once I had the scripts squared away. The slaves had a cron job that would check if they were registered with the master, and if not, register themselves.

Next step was getting metrics into CloudWatch that we can use to trigger scaling events. I whipped up a script that polled the Jenkins build queue and counted total/busy/idle executors for each label, and pushed the resulting counts into custom CloudWatch metrics. I also enabled CloudWatch metrics on the ASG so I could track Total/Max/Min/Desired instances in the group. Since CloudWatch only stores metrics at a minimum resolution of 1 minute, the script can run from cron without missing anything.

Upgrading and tearing down slaves

When we wanted to upgrade the Jenkins slaves, we needed to kill off all the instances in the ASG so they would be replaced by new instances that used the new AMI. I came up with kind of a stupid party trick involving queuing, which took advantage of these slaves happening to have only a single build executor slot each. This needs the NodeLabel and Matrix Project plugins installed. I created a "suicide" Matrix job that when run, executes on every slave matching the workhorse label; it was not much more than a shell build step that executes:

aws ec2 terminate-instances --instance-ids $(ec2metadata --instance-id)

Because these slaves do not execute jobs in parallel, it's safe to dispatch this 'suicide' job.

If the slave instances had been configured with >1 build executor, this job could make a call to the Jenkins API on the master asking it to set the slaves build executors to 1 so we could still take advantage of this pattern. If multiple jobs are already running on the slave, they will continue to run in parallel, until the queue is empty, then the suicide job will run safely. In practice, we should probably also update the slave's label to prevent a race condition where a terminating slave begins running a new job.

At this point I had just about everything in place to dynamically autoscale, but there was one tricky problem left- When scaling in/down, AWS doesn't give you really any control over which instances in the ASG will be terminated, you can associate a policy that will make it favor the instance closest to the billing hour, the one with the oldest launch configuration, the best one to terminate to balance instances across AZs, etc. So, how do we stop it from killing a slave that's still running a job?

Autoscaling lifecycle hooks is part of the solution. This allows you to register hooks to run at startup or termination. With the termination hook registered, what happens when autoscaling terminates an instance is:

  1. target instance state is set to terminate:pending but does not shut down
  2. a Simple Notification Service (SNS) notification is sent, whose payload contains some data like the instance id that will be terminated
  3. the instance will keep running until either a timeout expires or its state is changed to terminate:continue, at which point it will actually terminate.

So, I registered the hook, set a timeout to around the maximum time a job on one of these slaves should ever take to run, subscribed a Simple Queue Service (SQS) queue to receive these SNS messages, then I wrote a script. The script runs on the Jenkins master and polls the SQS queue, when it receives a notification, it does a variation of the suicide job: it changes the target slave's label so it will stop accepting new jobs, sets the number of build executors to 1 if it is not already, then it queues a job on the slave by using the NodeLabel plugin which changes the slave's state to terminate:continue and the slaves get terminated safely.

Finally, we can dynamically autoscale this thing!

Tuning autoscaling

Getting dynamic autoscaling right is a bit of a nuanced art. Amazon bills by the instance hour and still charges you for an hour if your instance runs for one minute, you can get burned badly and end up with a huge AWS bill if you're not careful. You always want to scale up fast, but scale down slow. What I found to be effective in this situation was scaling up by increasing DesiredInstances if there are jobs queued. And beginning to scale in once IdleExecutors is >3 for 50 minutes.

I set up a CloudWatch graph that had all the metrics for the ASG and the Jenkins slaves in one place--another pleasant side effect of having 1 executor per slave was the effect it had on this graph, it was very easy to visualize and reason about what was going on. For example, I could spot and aim to minimize fluctuations in DesiredInstances that added up to waste, and I could see that someone started up a slave outside of autoscaling because TotalExecutors was 1 higher than DesiredInstances.

Conclusion and lessons learned

Getting these Jenkins slaves to autoscale was quite a bit of work but it was also a great excuse to learn and experiment. It was a success and saved money immediately, it cut the instance-hours on this group of servers in half and settled the debate about the right statically-set number of slaves to use. It was also surprisingly robust despite the number of moving parts, and the repercussions of any part of this autoscaling scheme being broken weren't terrible.

I learned a ton while working on this and will surely be applying that knowledge to other projects in the future. For example, the next time I'm designing a service that consumes from a message queue I'll add a feature: a "suicide" message can be queued so the service can deal with lifecycle hooks. I've already built other "immutable infrastructure" delivery pipelines for other projects with what I learned.

December 20, 2015

Day 20 - Why NetDevOps?

Written by: Leslie Carr (@lesliegeek)
Edited by: Brian Henerey (@bhenerey)

We’ve read a lot of great ideas so far. However one crucial piece is missing in these conversations - the network!

Automation has moved from a new idea to something we take for granted in the new DevOps paradigm. For too long the network has been left behind from this awesome paradise and has been left to suffer in the dark days of manual configurations. Some engineers think that this is because network engineers simply don’t care about modernizing their ways, but this is not true for many! The truth is that tools to interact with switches and routers are still in their infancy. Cisco’s OnePK, which created an API for Cisco switches and routers, was only released to the general public in early 2014! CFEngine began in 1993! The systems world has had over 20 years of a head start over the network world! There has been much debate over why this lag has occurred, but I am not going to jump into the middle of that. Today I am going to tell you how to bridge this gap.

Collaboration

The fact that network teams deal with the world in a different manner than systems teams is not only a technological problem. These differences prevent easy collaboration. Systems and network teams are often divided with ticket walls in between silos. In many companies this causes friction or even hostility in between teams, instead of promoting empathy and collaboration. As an example, take new system setup. The network team has to configure ports and assign vlans. The systems team has to turn up the machine and get it running in its proper roles. In a better world, one team could just send a pull request over to the other and it would be a quick review. In a normal company, a ticket is opened. The network team has more urgent and interesting tasks to do than port configuration. The systems team is waiting, feeling helpless and frustrated. Resentment builds up instead of harmony and cooperation!

Software

Rancid has been the one and only way to deal with gathering network configurations. It’s not a bad tool - it works with almost everything. The problem is that Rancid deals with the world through the lens of an older paradigm. Rancid logs into network gear with passwords stored in plain text and then uses expect scripts to basically screen scrape data and push it into a CVS or SVN repository.

The network teams do not have to reinvent the wheel and can use the same software that the systems teams use. The popular configuration tools, like Chef, Puppet, Ansible, and Salt are starting to interact with switches and routers. The bad news is that these tools often only configure a subset of the possible configuration options and only work on a few models of switches. While we wait for all of our gear to be easily supported, the network world needs to take some intermediate steps to move towards the standard future.

The DevOps movement has evolved the systems world to one where infrastructure is defined by code, with all of the benefits involved. The network world needs to join up, with NetDevOps.

Templatize configurations

Many switches and routers still don’t support automation tools. As well, most switches share the same basic configurations, with ports and IP addresses as changing variables. We can use our automation tools and templates to automatically output configurations.

Tear down the ticket wall

Now that your configurations are code, you can use git pull requests for all of those little things you used to ticket. Breaking down the ticket wall in between team silos improves everyone’s productivity and empathy. Spotify calls this an internal open source model and uses it to help make their company so successful.

Code Reviews

Now that all changes are via pull requests, you can use tools to enforce that requirement and code reviews by a second person. If you are not using version control software to centralize and change your configurations, code reviews are only possible by someone looking over another person’s shoulder before they hit enter. Continuous integration tools like Jenkins and Travis CI aren’t just for code. You can write tests to check syntax or for spelling errors! Typos have brought down most major websites at some point in time (http://www.cnet.com/news/widespread-google-outages-rattle-users/). Computers are so much better at catching these mistakes than humans.

Testing

Virtualization and test environments aren’t just for servers any more. And you don’t need to buy an entire second set of hardware just to test out your changes. GNS3 is a great, open source tool that has VM support for most major vendor platforms. You can make a model of your network and test out that BGP change before your 3am maintenance window. No more “I’m pretty sure this will work in production!”

The future starts with you

If you want your networking equipment to support automation clients directly, the best thing you can do is to pressure your sales people. Until about a month ago I was working for a vendor and a large amount of feature prioritization is based on customer demand and feedback (these are for-profit corporations, after all!). We’re not in the old fashioned world where Cisco can dictate how your network is run. There are so many choices nowadays. If we demand that the network vendors help us to run the networks just like we run our servers, they will listen because we can vote with our wallets.

Hopefully soon, we’ll tell the stories of how we used to login to a router via telnet and type commands directly on the command line like a ghost story, to scare the new junior admins around the campfire!

December 19, 2015

Day 19 - HTTP/2

Written by: Anthony Elizondo (@complex)
Edited by: Tobias Brunner (@tobruzh)

The World Wide Web. You use it everyday. It might even be as old as you are (born in March of 1989). And it all runs over Hypertext Transfer Protocol.

The first draft of the HTTP specification was pretty simple. Just ~650 words, ~25 lines. Since then we've improved on it with HTTP 1.0 and 1.1. It can stream video and retrieve your email.aspx). We've wrapped some security around it with SSL and now TLS. And now we're moving on to the largest change in the protocol in its history: HTTP/2.

Oooh shiny

In the last 26 years HTTP has enveloped the world. Other protocols such as gopher and FTP have fallen in prominence, and seemingly everything runs over HTTP. HTTP APIs are preferred over all other transports, the Web is arguably the world's largest application platform, and web browsers are the planet's most widely distributed runtime.

HTTP/2 was finalized in May of 2015 in RFC 7540 after ~3 years of work. It builds on an earlier protocol from Google called SPDY. SPDY is due to be deprecated in January 2016.

Changes

The main drive in development of HTTP/2 was performance. With so much of the Internet consisting of Web traffic, optimizations in speed and efficiency would have large payoffs. In addition, end users are shifting to a more mobile (or mobile-only) world of increased latency. Plus, web pages are becoming more and more complex, in both size and composition. Security was also an area of focus, with internet privacy at the forefront of many web users’ minds.

HTTP/2 uses a single persistent connection

Over this single connection requests and replies are multiplexed. Like SSH multiplexing, this means less time spent doing connection setup/teardown. (For instance, SSL handshakes are minimized) It also prevents "head of line" blocking which means one slow request will no longer delay subsequent requests.

HTTP/2 is a binary format. This means it can be extremely fast (think Protocol Buffers instead of parsing XML or JSON). The downside of this is we lose the ability to debug connections with "telnet host 80" and "GET /". But hey, with encryption everywhere we couldn't do that anyways, right? If you do need to look at HTTP/2 "on the wire" Wireshark already has a plugin. And for debugging curl 7.43.0 and higher has HTTP/2 support built-in.

TLS Required

Speaking of security, one of the contentious points during the protocol's development was the requirement of transport-level encryption (with TLS). In the end TLS was left as an optional flag; servers are not required to use HTTPS when serving content. However, as it happens, all of the browser makers have agreed to only support HTTP/2 when used over TLS, making it a de-facto standard.

Deployment

What do you have to do? On the client side, nothing! You're likely already running a browser that speaks HTTP/2. Thank you, auto-updating browsers.

And many servers already support HTTP/2. Facebook, Google, and Twitter all already support HTTP/2, enjoying decreased bandwidth usage and better performance, among other benefits. (Oracle, Amazon, Apple, get with it.) If you help maintain a website, consider working with the web developers to upgrade the servers to support HTTP/2.

Enabling HTTP/2 in nginx is a single entry in the server listen directive. In Apache it is 3 extra lines.

And surprise: you're likely reading this article over HTTP/2 right now! https://sysadvent.blogspot.com is HTTP/2 enabled. Check your website, courtesy of KeyCDN. You can also install plugins to add handy indicators to Chrome and Firefox. If plugins aren’t your style you can use Chrome Developer Tools (Network - Right click on columns and show Protocol) or Firefox Developer Tools (Headers - Version). Safari and Edge will show it as well.

Is it better? You can test it for yourself at https://www.cloudflare.com/http2/

(Can you turn HTTP/1.1 off? Probably not a good idea. HTTP/1.1 traffic will likely be around for a while.)

Holiday Easter Eggs

An HTTP server that is seeing too many requests from a client can respond with a RST_STREAM frame containing error code 0xb, ENHANCE_YOUR_CALM. In the future all restaurants are Taco Bell.

There is a "magic string" that browsers send in a header to servers to indicate that they can speak HTTP/2. In 2013 this string was modified in the HTTP/2 specification to contain a reference to PRISM). http://blog.jgc.org/2015/11/the-secret-message-hidden-in-every.html contains further comments from John-Graham Cunning and Mark Nottingham.

Conclusion

HTTP/2 is a major, exciting change in the protocol we all know and love. Hope you enjoyed this quick tour of HTTP/2.

Thanks

Thank you my editor, Tobias Brunner @tobruzh, for his suggestions and corrections. A very large hug to Chris Webber and the organizers of SysAdvent for pulling this project together each year.

References

Ilya Grigorik’s fabulous book High Performance Browser Networking has an excellent chapter on HTTP/2.

Mark Nottingham’s "What to Expect from HTTP/2".

HTTP/2 FAQ https://http2.github.io/faq/

December 18, 2015

Day 18 - Deployments done the Delivery way

Written by: Christopher Webber (@cwebber)
Edited by: Ted Young (@jitterted)

This year has been all about Delivery for me, starting with getting https://www.chef.io deployed on stage at ChefConf using Delivery. It has been a blast moving services into Delivery and using Delivery to build new ones.

What Even is Delivery?

In the simplest of terms, Delivery is tool for enabling continuous delivery. It has been shaped over many years of experience working with folks all across the industry in building their pipelines. For me, it is an opinionated build and deployment pipeline. Explaining why things are the way they are is a bit outside of the scope of this post. What follows is a brief overview of the way I view the world.

Phases and Stages

Delivery is made up of a set of stages: Verify, Build, Acceptance, Union, Rehearsal, Delivered. There are manual approval steps between the Verify and Build stages, and the Acceptance and Union stages. Each stage is made up of a series of phases where actual tasks are executed.

Below is a list of the stages, and the phases that they execute.

  • Verify: Before another developer reviews the code, verify it is worthy of being viewed by a human.
    • Unit: Unit test the code that you are deploying. For a cookbook, this is probably ChefSpec, for a Rails app, it might be RSpec or minitest.
    • Lint: This is a test of whether your are properly formatting your code and following best practices. For Ruby apps, you probably will want to run Rubocop.
    • Syntax: Is it parse-able? Just like we do a configtest before restarting nginx or apache, it is useful to do the same with our code.
  • Build: Now that code review is done and we have merged to master, let's build some artifacts (cookbooks, packages, images, etc.)
    • Unit: Same as before, but now on an integrated codebase (we merge the code to master during the manual approval step between Verify and Build).
    • Lint: Same as before, but now on an integrated codebase.
    • Syntax: Same as before, but now on an integrated codebase.
    • Quality: This is where you might fail a build it if doesn't have the right amount of code coverage, etc. You are looking to test that the code meets a quality standard of some sort.
    • Security: Test the code for security. In Rails, running Brakeman along with bundler-audit is a great place to start.
    • Build: Produce artifacts that we can promote through the process. This may be a cookbook, a software package, or even a system image.
  • Acceptance: Setup the artifact(s) in an environment where we can verify that they are ready to go to production. We have a manual step after this to give someone a chance to poke around and make sure things work well.
    • Provision: What this means for your environment may vary, but I usually use it to stand up the instances I am going to deploy onto and any other supporting pieces, such as ELBs, RDS Instances, Elasticache Instances, etc.
    • Deploy: In most cases, this is a matter of, run the cookbook associated with the service.
    • Smoke: Does it work? For most web services, it is as simple as making sure you get a 200 OK from a healthcheck endpoint to prove, yup, it started. These tests should be super lightweight to provide fast feedbackin case it fails, so we don't waste time doing Functional tests.
    • Functional: This is where we ensure it functions correctly. Whether that is testing a bunch of endpoints, running selenium scripts, or pointing metasploit at the instances, you want to validate that the system is functional.
  • Union: Do the upstream services still work? After we do a pass on the service we are deploying, we go and re-run the phases in the union stage for the projects that have declared a dependency on this project.
    • All phases are the same as in Acceptance.
  • Rehearsal: Ensure that we can do the deploy one more time cleanly.
    • All phases are the same as in Acceptance.
  • Delivered: Actually build out the "production" service.
    • All phases are the same as in Acceptance.

As you may have noticed, most of the phases are executed in more than one stage, allowing us to ensure that the state of the world is good. For example, all of the phases that run in Verify also run in Build to validate that things are still good. And in Acceptance, Union, Rehearsal, and Delivered, each stage runs the same set of phases to build each environment the same way.

Ship it!

So what does this actually look like? For most services, I see it break out into three pieces:

  1. The application: This is the actual service you are going to run. It is usually the base of the repo.
  2. The deploy cookbook: A cookbook that lives under cookbooks that you run on the node on which the service is running.
  3. The build cookbook: A cookbook that lives at .delivery/build-cookbook that handles all of the moving parts that make the service go.

Most of us are familiar with the first two. The application is the actual thing. If you are a Ruby shop, it is probably a Rails or Sinatra app. If you are a Java shop, it might be a Spring app. Whatever it is, it is the actual thing you are deploying. The deploy cookbook is the configuration management code that makes the node do the thing. If you are deploying a Rails app, it probably sets up nginx, adds some users, and spins the application up using Unicorn.

The Build Cookbook

I want to focus on the third item for a bit. The build cookbook is what Delivery, using the delivery-cli, actually runs. Each phase is represented by a recipe. So there are unit, lint, syntax, provision, etc., recipes in this cookbook. There is also a recipe called default, which is run as root, and runs at the beginning of each phase. Once that finishes, the actual phase recipe is executed with non-root privileges. The build cookbook provides the directed orchestration I have always wanted: I can stand up a database, run the data import, and then, only if that succeeds, start up the app instances. In the case of omnitruck, we make sure that the instances have everything they need, like a load balancer and CDN service before we deploy the code.

The Shared Repo

Since all three pieces, the application, and build and deploy cookbooks, are all in one repo, changes to any aspect of the application can easily be found. The coolest thing is that we now tie all changes to the service to a single commit history. This means, if we make a change to the app and a corresponding config change is needed in nginx, we see it all together. It also means that all changes to the app are tracked in one place. Whether we are tracking that a new route was added to the app or that a new header was added via the load balancer, all of the changes are wrapped up neatly in a single log of commits.

Deploying Omnitruck

Chef runs a service called omnitruck that provides information about packages used by chef-client and other tools. The application follows the pattern I outlined above. You can visit the omnitruck repo and browse through the code. Here is a high level overview of what it looks like to ship omnitruck.

  1. The process starts with someone creating a change, automatically kicking off the verify stage. If it passes, we review the code and approve the change.
  2. From there it heads to build and acceptance. In the build stage, we get a set of deployable artifacts. For omnitruck, it is the deploy cookbook being published to the chef-server and the source code being neatly packaged into a tarball.
  3. The fun really begins in the acceptance stage where we start standing stuff up. We provision an ELB, a set of EC2 instances, some CDN config, and some DNS entries.
  4. Still in the acceptance stage, we next use the deploy cookbook, which comes from the chef-server, to deploy omnitruck to the EC2 instances. If the chef-client run completes successfully on the EC2 instances, we flush the cache on the CDN.
  5. Smoke and functional tests then run to ensure that we are good to go.
  6. In union, we do it all over again, except, that we rerun the Union phase of each of the consumer projects, which are other projects that have declared a dependency on the service we are shipping. For example, while you can't see it in the omnitruck repo, there is a project called chef-web-ocfrontend which defines the nginx instances that support www.opscode.com. That service depends on omnitruck so we verify that it didn't get broken in this process.
  7. From there, we move on to the rehearsal stage and the delivered stage which makes the project live.

Conclusion

As I watch Delivery mature, I am amazed at how awesome the workflow has become. While the product Delivery is closed source, the delivery-cli, which handles the actual running of code is freely available for download.

December 17, 2015

Day 17 - Grokking systemd for Fun and Profit

Written by: Tyler Langlois (@leothrix)
Edited by: Ben Cotton (@funnelfiasco

That's right: this post is about the "s" word. We'll be looking at how to leverage systemd for great good, whether you're a wary user on a foreign system or intending to use these features extensively in your own infrastructure. I'd encourage readers to set aside any existing biases for the project and join me in exploring some of the underlying capabilities of these tools - because whether you agree with its design philosophy or not, understanding the project better can only serve to better the conversation.

Before we start throwing levers and pushing buttons, it behooves the wary sysadmin to get familiar with the logging system behind systemd: journald.

Dear Journal:

Like other systemd tools, journald is somewhat familiar yet functionally different from its traditional predecessors. Coming from a traditional /var/log/ background, some of these differences are particularly worth noting:

  • A simple tail on journal log files in /var/log/journal isn't sufficient - journald encodes stored logs in a binary format that includes supplemental metadata fields (and some "trusted" fields we'll get into later) which require use of the journalctl utility.
  • By default, journald will attempt to intelligently rotate old logs based upon disk storage availability: the daemon will use either up to 10% of available space or keep 15% available on the filesystem it logs to, whichever is smaller (incidentally, this also means logrotate is somewhat moot for journal files.) The man page for journald.conf explains disk usage logic in greater detail.
  • Log entries and the aforementioned metadata can be read with a straightforward pager and subjected to basic query logic (matching fields based on AND, OR, and so on) through journalctl.

Log metadata is stored within fields in a log entry. The following example demonstrates what a snippet of this structure looks like, taken from the logs of an sshd unit:

...snip...
MESSAGE=Server listening on 0.0.0.0 port 22.
_PID=226
_COMM=sshd
...snip...

Many of these fields are self-explanatory: the daemon emitted some text on stdout, which is the MESSAGE field. There are also two additional fields here prefixed with "_", indicating their status as "trusted" fields. This means their content has been derived from the supervising process and is outside the control of the process emitting the log in question - hence, the validity of their contents is "trusted".

Given that information, we can trust that the real process ID of the log originated from 226, and arose from the invocation of the sshd command.

However, this also implies we can control journal fields! It turns out that from within daemons, journald calls can be made that pass explicit values for fields. Note that most of the time, daemons will usually just spit to stdout and modify MESSAGE implicitly unless some special code is written to tell journald about extra fields (I'll use the ruby journald library to create fields in the following examples.) What kind functionality can we build on top of this with some simple code?

Based upon simple use cases I've played with, user fields have been particularly handy when looking to elevate logs that need to be seen by a human. For example, the following dead-simple ruby daemon listens for journal events that have a user field of NOTIFY_SLACK=1:

#!/usr/bin/env ruby

require 'systemd/journal'
require 'slack-notifier'

webhook_url = ENV['SLACK_WEBHOOK_URL']

j = Systemd::Journal.new
notifier = Slack::Notifier.new webhook_url,
                               channel: '#bots'

j.add_filter('NOTIFY_SLACK', '1')
j.seek(:tail)

j.watch do |log|
    notifier.ping "#{log.message} (/cc @tylerjl)"
end

I run this daemon on my personal server and make calls like the following within my personal tools when I need to them to ping me on Slack (again, example in ruby):

Systemd::Journal.message(
  message: "Singularity achieved!",
  notify_slack: '1'
)

This means I can watch for logs system-wide I've tagged for elevation into chat channels. Note also that filtering on a user field as the ruby code does here is a similar exercise from the command line:

journalctl NOTIFY_SLACK=1

Some journalctl flags are particularly useful additions to an admin's toolbox:

  • Live-tail the system's logs: journalctl -f
  • Examine all logs from the previous boot, starting at the end: journalctl -e -b -1
  • Run a check against all journald log files to verify file integrity: journalctl --verify

The Wild World of Unit Types

The .service unit is a friendly face: intended to represent persistent services and daemons, it's pretty self-evident what happens when one invokes systemctl start sshd.service.

However, systemd units model more than persistent or oneshot daemons. You've probably heard of a couple of these:

  • timer - activates associated service units based on monotonic or realtime intervals.
  • socket - listens for data and passes it along or explicitly starts other units.
  • path - watches directories and files to activate other units based on inotify-driven events.
  • target - analagous to SysV runlevels, these permit you to create your own arbitrary targets to organize groups of other units.

Consider the path unit. This unit type can create fairly straightforward relationships between inotify events and service units. One use case for this could be a path unit that watches a directory for new backups to appear, and activates another oneshot service unit to archive newly appeared files to S3 or another remote endpoint.

The timer unit, in particular, has a familiar use case: triggering events based on a schedule. Timers typically replace crontabs in systemd-centric environments when paired with oneshot services. There are general differences here, including calendar formatting and the introduction of persistent and monotic timers. The reference under man systemd.timer is a sufficiently detailed guide for reference.

If you're among the jaded rank-and-file of ops people just waiting for things to fail, one noteworthy option in this use case is OnFailure=. I've often found that introspection into the success and failure of cron jobs to be fairly opaque; the default email behavior can be difficult to leverage effectively. This option enables a degree of reporting for units that may experience problems.

Consider this example oneshot unit that checks the status of a ZFS pool (if you haven't tried ZFS on Linux yet, you totally should):

$ cat /etc/systemd/system/zfs-check.service
[Unit]
Description=check result of last ZFS scrub
OnFailure=failure-report@%n.service

[Service]
Type=oneshot
ExecStart=/usr/bin/sh -c 'zpool status | grep "No known data errors"'

This service could tentatively be triggered weekly by an associated timer. Notice the OnFailure= option. That service file could look something like this:

$ cat /etc/systemd/system/failure-report@.service
[Unit]
Description=failure notification for %i

[Service]
Type=oneshot
ExecStart=/usr/local/bin/unit-notify.rb %i

The "@" in the unit name means this unit can be instantiated with different suffixes. In this case, we're invoking the unit as failure-report@%n.service from the ZFS unit, which translates %n into the unit name, zfs-check. The unit-notify.rb script can then be something as simple as:

$ cat /usr/local/bin/unit-notify.rb
#!/usr/bin/ruby

Systemd::Journal.message(
  message: "Unit #{ARGV[0]} has failed!",
  notify_slack: '1'
)

Coupled with the previously mentioned journald daemon, notifications for failed units can be fairly streamlined - just add OnFailure=failure-report@%n.service and go. When running systems at greater scale, this could also be applied to calling webhooks to gather unit activity in aggregate.

Into the Belly of the Beast: dbus

There may come a time in the course of dev-ing your ops when you need more finely-grained programmatic access to systemd. In such cases, although unit files are usually available on the filesystem, more detailed information regarding system state can often be acquired through dbus.

The documentation for the systemd dbus API is fairly comprehensive and outlines all of the capabilities of this interface to systemd. In a nutshell:

  • methods can be called from the API to perform certain actions, such as reloading a unit or rebooting the host machine.
  • signals can be subscribed to in code to trigger actions - think interrupts for your init system.
  • properties represent settings or traits about units, the host system, and other systemd objects.
Amid all these sources of data, there are rules and interfaces for accessing them. One thing to bear in mind when coding against dbus APIs is that they can get a litle wordy. Consider this code snippet that prints the environment for the sshd service:

from dbus import SystemBus, Interface

bus = SystemBus()
systemd = bus.get_object('org.freedesktop.systemd1',
                         '/org/freedesktop/systemd1')
manager = Interface(systemd,
                    dbus_interface='org.freedesktop.systemd1.Manager')
unit_path = manager.LoadUnit('sshd.service')
unit_proxy = bus.get_object('org.freedesktop.systemd1', str(unit_path))
unit_props = Interface(unit_proxy,
dbus_interface='org.freedesktop.DBus.Properties')
for env_var in unit_props.Get('org.freedesktop.systemd1.Service', 'Environment'):
  print(env_var)

There's a great deal of flexibility here, especially when you peruse the list of interfaces on the API documentation - as long as you're willing to slog through some fairly verbose setup. Here are some use cases for the dbus API that I've personally found useful:

  • Subscribing to the PropertiesChanged signal of a unit to watch for unit state changes.
  • Bringing custom BusNames online to handle special dependency management. For example, a custom service may successfully fork but may be unable to serve a dependent service until it verifies itself as healthy, in which case you can use the BusName= service option to watch for the green light before indicating a service's dependent units should proceed.
  • Obtaining a list of all units currently loaded, running, or failed on the system.

Bonus Round: Ops Grab Bag!

If nothing else, I hope that you may find something here that will make your life easier when working with systemd. With that in mind, Consider these tips when you're in the trenches:

  • Need to quickly override a service option that shipped with your distro's service file? Use systemctl edit foo.service to drop into an editor and create an override file that will be automatically written to /etc/systemd/system/foo.service.d/override.conf. The override file's options has precedence over the stock unit.
  • Need to express that nginx.service requires php-fpm.service to be running? Run systemctl add-requires nginx.service php-fpm.service and the appropriate symlink will be created for you in /etc/systemd/system/.
  • Need to watch the log files for a service? Live-tail a service's journal with journalctl -u foo.service -f.

Conclusion

I do hope that this has been an informative tour of some of the practical considerations to bear in mind when working with systemd. The jury is still out on whether systemd will destroy the planet or usher in a thousand-year peace between all distributions, but a little extra knowledge never hurts.

I'd like to thank Ben Cotton for generously editing this piece, and for the SysAdvent organizers being a delightful group of people to work with. May your pager be silent and holiday be downtime-free!