But Ansible does it all!!
Most people running networks have a decent fleet of Cisco IOS devices (or devices that feature "industry-standard" *cough* ripoff *cough* CLI). Often times, the discussion of Ansible as a framework comes up as the solution.
What does Ansible do?
Ansible is designed to express desired state, compare against existing state and then calculate whether a change is needed. A great example is that we'll generate a config file with a template, push this to a server, and then signal the relevant service (perhaps with a restart). Cool! Easy! But what if the config file is already the same as what we generated from template? No problem, we'll calculate that the files are the same and do nothing. No restart, no problem.. its already good.
This technique is highly effective for things like webservers where the service supports a consistent and suitable approach (in context, of course) for loading the new config.
kk.. cool but what is wrong with IOS?
IOS-like CLIs are, generally speaking, centered around the idea of a big config file (conceptually, anyway.. details are a bit more complex). In order to make changes to this config file, we can either:
- Have the parser merge another (subset) config file
- Make changes directly from CLI via parser
- Install an entirely new config file and reboot to have it take effect (letting a clean parser take over)
None of these models allows for in-service wholesale replacement of the configuration. Some people will point to a config replacement tool in IOS, but it is very finicky and may cause rather unexpected effects (not to mention, it is NOT an atomic replacement). This problem is semi-resolved in IOS-XR.
Our goal, for reasonable in-service config changes requires calculating delta and applying only the delta. As IOS-like config is only loosely structured and provides no transactional support, this requires the automator to consider the underlying config structure, order of operations and how the presentation in the resultant config file will look (I know you didn't put those three extra spaces, but cisco decided to put them there).
That's not a big deal though!
In certain initial provisioning cases, I agree: this is not a major problem. We can decompose into individual steps and assume everything works. Plus, it is unlikely to result in an outage if we make a mistake.
In production case, however, the behavior is much more interesting: if we "no ip access-list <blah>" then "ip access-list <blah>" and start adding entries, what happens?
When an ACL is attached to an interface but undefined, all traffic is permitted. Once the access-list is defined, all traffic is denied through implicit deny. That means, when you did "no ip access-list <blah>", the whole world can access everything. Then, with "ip access-list <blah>", nothing can access anything. Oops, that was the interface we were using to do management. Shit. Similar behavior exists for prefix lists and some other structures.
Ansible doesn't rationalize the denominator
In the ios_config module, it works around this problem by expressing lines of configuration that should be present a section. The section can be global config (default) or of some parent (example, "ip access-list extended test"). It cannot consider the whole configuration and be sane. It cannot replace an ENTIRE configuration to reach a truly desired state.
It also will not reflect the order that config must be in. Example. consider the following access-list:
ip access-list extended test
permit tcp any 10.0.0.0 0.0.0.255
deny ip any 10.0.0.0 0.0.0.255
And suppose in our task, we set it up like this:
ios_config:
lines:
- permit tcp any 10.0.0.0 0.0.0.255
- permit udp any 10.0.0.0 0.0.0.255
- deny ip any 10.0.0.0 0.0.0.255
parent: ip access-list extended test
From a simplistic point of view, we would expect to see:
ip access-list extended test
permit tcp any 10.0.0.0 0.0.0.255
permit udp any 10.0.0.0 0.0.0.255
deny ip any 10.0.0.0 0.0.0.255
After running, however, we'll see:
ip access-list extended test
permit tcp any 10.0.0.0 0.0.0.255
deny ip any 10.0.0.0 0.0.0.255
permit udp any 10.0.0.0 0.0.0.255
Oops! We broke it.
Then it breaks and acts like we're good.
Consider a router or switch or whatever. It has the following (relevant) config:
interface Vlan20
ip address 10.20.0.1 255.255.255.0
no shutdown
Now, suppose we make a booboo and we try to do something like:
ios_config:
lines:
- ip address 10.20.0.200 255.255.255.0
- no shutdown
parent: interface Vlan30
For most IOS users, the problem is immediately obvious: interfaces Vlan20 and Vlan30 will be in the same subnet and IOS will barf out an error and not apply this config. Something like:
10.20.0.0 overlaps with Vlan20
However, Ansible doesn't know any better and continues onward as if the configuration change was successful. Perhaps we had added an additional task to swing over the TACACS source interface to use Vlan30. Whoops. We just lost TACACS.
Well, then I'll just not do these things you're talking about and It'll All Be Fine™
Error handling is mandatory in production automation solutions. In general, our goal is to determine if an error exists and either 1) do something about it or 2) bail out. Option 3), ignore the error, should only happen if 1) is implemented and implies "I don't _need_ to do something about it".
What is not EVER acceptable is having errors unhandled and ignored. Unfortunately for us, errors in IOS-like CLIs will spew some unstructured and inconsistently built message on the terminal. Is it an error, is it a warning, do I know?
Further, as IOS is a very closed OS, we cannot simply look to documentation or source code to determine all of the possible errors that could be spewed. We need a different approach. We need to treat all messages as errors and either 1) handle them (and potentially ignore them, but in an explicit way) or 2) bail out.
Ansible does none of this though. It handles for the basic invalid command case and nothing else.
Cool so Ansible can show me the messages and I can use this and that, that's good enough for me!
No, you cannot. The current ios_config module does not provide the actual terminal session data. So, you can't see these silently ignored errors.
Further, what if one of your lines has a dependency on the previous line? Well, it will cause even more problems. A workaround is to create more tasks.
Fine, so I don't care about these error conditions and I'm okay decomposing into multiple tasks. Let's do it!
Final note, Ansible will use one SSH session per task. What I mean is that, for each task, Ansible will open an SSH connection, get a terminal, setup the terminal ("ter len 0", etc), do its thing, then disconnect. I call this behavior "kerchunking" because it is similar to keying up a radio, saying something short and mostly individually useless, and then disconnecting.
This doesn't seem like a major thing, but many bugs have come up in IOS for SSH-related issues when doing this. It can get so bad in some cases that SSH becomes completely inaccessible. I recall a specific bug that even messed up the console from SSH connections.
Solutions that don't suck
Are hard to find for IOS-like CLIs. The least shit works something like this:
- Break down services into config components
- Express the config components in terms of a data structure that maps to the service
- Perform delta on the service config components
- Push the lines of config
- If there is a message in between the command prompts, either handle it or bail out. This means that any message you, the automator, does not know about should result in bail out.
- Have a rollback for each logical step in the procedure if bail out occurs.
- Keep a complete session log
- Do this all in one SSH session
Comments
Post a Comment