Tooling against proprietary CLIs
Proprietary CLIs are often considered the "normal" way of interacting with IP network devices. These CLIs allow for bad design with respect to automation.
CLI transport methods and notes
Vendors regularly shit the bed with their CLI behaviors. Some of the components worth looking at are transport, authentication, the TTY, and error handling.
Transport
Some things only do telnet? In 2017?! Meh, yes, some.Console
You need a commserver. Better to write everything against this than trying to mess with shell scripts. Problem is: you have to adjust settings carefully for the serial link to work well.
Console is the most manual idiotic shit to write for. It should be treated as such, used for basic provisioning tasks only.
SSH
This is usually what we want. Two basic modes of operation: pty allocation or an "exec" type. In PTY, shitty expect-like garbage will be necessary. In exec, you have the POSSIBILITY that return codes actually come back and do something.
Usually though, exec type will have some ridiculous limitations (like prohibited to perform configuration or anything with a prompt). So, you'll end up back in PTY land and expect scripting.
Telnet
2017 and telnet is still around in places. expect scripting 100%.
Authentication
You will need to store passwords. Figure this out now. Devices will not consistently support any key-based auth. Cisco will do X.509 certificate auth, others will have a hack for SSH signature auth.
One way that would be useful is to have on-demand accounts.
Because of stupidity, some vendors/appliances do weird shit for authentication sequence. Example, Packteer Packetshapers don't even have the concept of a proper user/password for local auth. Instead, they rely on simply a password.
In another example, I think cisco APs, they will go through the process of SSH password auth, all for it to be ignored and you to be prompted for creds on the PTY. More expect work.
Error Handling
Being slick with your scripting is cool, but if you're not doing error handling, you will eventually make a gigantic mess.
In expect-like scripting, we have to consider what an error looks like. Usually, we'll see something like a command + some message and then the prompt. Example:
switch#show lg
^
% Invalid input detected at '^' marker.
switch#
Cool, so we found one error condition! We're good, we can note that and go on. Of course not :)
Think about configuration commands. Sometimes they will have errors or informational messages. Example:
switch(config-if)#ip address 10.30.0.20 255.255.255.0
10.30.0.0 overlaps with Vlan30
switch(config-if)#
Erm, this this an error or informational message? Do we see something immediately obvious to differentiate? No, we don't. And, in fact, this is an error!
So then, it might be easy to say "Hey, all messages between config commands are errors!". That would be nice.
switch(config-if)#ip address 10.40.0.0 255.255.255.254
% Warning: use /31 mask on non point-to-point interface cautiously
switch(config-if)#
So, wait, we can have errors, but no consistent mechanism to detect them?! Welcome to expect-like scripting.
But hey, my script worked! That's good enough!
No, it really isn't. This is because error conditions can happen _at any time_. What happens if you're modifying an access-list and violate a TCAM constraint (i.e. same match params between ACEs)? Accidental, totally unforeseen, I've never seen this message before. Stranger danger.
Now, you ignore the message because, your shit certainly works! And you attach the ACL to the interface because the vendor didn't properly check this constraint on attachment (and, even with the error message, it went ahead and built the ACL in software!). This kind of issue is not usually known to those outside of environments using heavy automation, but this is the kind of issue that requires proper error handling. If we had caught it at ACL build when the error emitted, we could have bailed out!
Instead, your good enough script went and attached a pile of shit and it sprayed all over the TCAM. Weird shit happens and now you're on OOB trying to figure out how to unfuck things.
Timeouts
Another thing is remembering that the connection to the device might fail or your script might not do what it was supposed to. You have to time-bound the script somehow to "give up". The timing can be fucked up. Have seen situations where directly connected devices have timed out. Because of the way error handling was setup, the timeout wasn't handled and an error was thrown at the next step in the script.
This has to be an imperative for automation designers, especially with job runners. Some poor pleb needs to hop in and unfuck everything.
Rollback and Testing
If you can't test your change's effects, then you're missing the point of automation. If you can't automatically rollback, you're gonna need to account for this in labor.
Rollback in expect-like scripting requires looking at stages in the process. Example, in ACL replacement, we have two core phases.
1) Creation of the new ACL
2) Attachment of the new ACL
Let's say that this ACL will be attached to interfaces. Each attachment of the ACL is a potential error condition. Also, remember the new ACL might not be created correctly in step 1)
How should we address this?
Doing what makes sense
Practically, if we see an error during ACL config, we can know we're in the ACL creation phase. Let's simply bail out and nuke the new ACL. Done! In order to test this, we need to get the ACL from the device and ensure it is correct. Do NOT use the config to validate, use operational outputs (in case it is accepted in config but not in the appropriate software handling for the ACL).
In the case of an attachment error, the ACL might be broken, the interface might have some constraint, etc. We need to REVERT the changes. Make this easy on yourself. In a purely incremental config model, we need to put the old ACL back on and make sure that we don't nuke the old ACL until we're completely finished with attachment and TESTING.
Eventually, we end up at a point of no return. Once we're there, we're going to nuke things and our script can't and won't roll back. If rollback is needed, It will need a separate process (or a meatbag) to achieve this goal.
However, for each change step, one should be able to VERIFY that the desired operational state matches the desired state. This should be a core network engineering practice, but it is usually not because of laziness and incompetence.
Tools
Rude people. Some shit works okay.
RANCID
Will blast commands with minimal to no error checking. Instead of writing stupid scripts, use this.
Ansible
If your config is truly a file and replaceable, this is great. Other uses of this tool for incremental config changes is stupid. You're making a big mistake and it will crash in flames sooner or later.
Netmiko
When you actually want to implement proper error handling, Netmiko is an easy way to go. Unlike expect, you can actually capture entire outputs, do shit, compare and branch.
Under the hood, this is paramiko. But netmiko solves some initial connection problems.
Straight SSH/Telnet
You're gonna reimplement expect-like behavior. Not rocket science, but very boring.
You can't do things like set the prompt line on connection to some known value (you can through envvar with bash). So, you'll need to:
1) Properly discard the banner bullshit
2) Find the prompt token (like # or >)
3) Remember that the prompt token might be used in something else.
4) Also, when you get a prompt, it isn't gonna make a newline. So, plan on some nice regexp-based way to sort this shit out.
5) And finally, timeout when it makes sense.
I recommend looking at how RANCID does this prompt detection for your given platform.
Comments
Post a Comment