The goal of this project is to demonstrate the following:
- Applicability of CI/CD principles to data centre networks
- The use of Arista cEOS and docker-topo to build arbitrary network topologies
- The use of Batfish to analyze network configurations and verify the expected control plane and data plane properties
- The use of Robot Framework for network testing and verification
- The use of Ansible to generate complex network configurations from simple data models
- The use of Ansible to generate diffs for both network configurations and network state pre- and post-change
To achieve the above goals, I'll use the following topology:
We start with Leaf-1, Leaf-2, Spine-1 and Spine-2 fully functional. Our CI pipeline is going go through the following sequence of actions:
- Generate configurations for the new Leaf-3 (along with the necessary configs for both of the spine switches)
- Analyse these new configurations and their resulting data plane with Batfish
- Build a lab topology with docker-topo and run a series of Robot Framework tests against it
- Generate diffs between the proposed and the current production configuration
- Wait for a manual trigger to push the changes to the production network
- Collect and compare the contents of IPv4 FIB and ARP tables before and after the change
The generated configs will have several configuration bugs injected to best illustrate the most salient points of Batfish and Robot Framework, and the following walkthrough is going to explain how those bugs are found and flagged by both tools and how to rectify them and achieve a successfull build.
All of the components of this demo are encapsulated inside Docker containers. The following Docker containers are created directly on the test machine's Docker engine:
- Gitlab server - serves as a git repository and a CI server
- Gitlab runner - receives jobs scheduled by the CI server and executes them
- Docker registry - stores large Docker images to speed up the running of the demo
- Batfish - the server component of Batfish network configuration analysis tool
In addition to the above containers, Gitlab runner will spin-up Docker container to execute scheduled jobs and this container will, in turn, spin-up cEOS containers that will simulate the network topology. Such "nested containerization" is allowed through the use of docker-in-docker.
Note: Stable internet connection is required
Clone this git repository
git clone https://github.com/networkop/arista-network-ci.git && cd arista-network-ci
Save the full absolute path to the current directory:
export ROOTPATH=$(pwd)
Build the Batfish server Docker image (this may take up to 20 minutes):
docker build -t batfish tests/batfish/
Download cEOS-lab and Network Validation tools from arista.com and copy them into a local directory
ls -q *.tar.*
cEOS-Lab.tar.xz network_validation-1.0.1.tar.gz
Build the Docker executor image that will be used by the Gitlab runner to execute jobs (this may take up to 10 minutes):
docker build -t networkci .
Find out the default Docker bridge IPv4 address of the local host:
export PRIMARY_IP=$(docker network inspect bridge --format "{{range .IPAM.Config }}{{.Gateway}}{{ end }}")
This ip will be used in a default url for Gitlab server and Docker registry
Enable insecure Docker registry at this address and restart docker daemon:
# cat <<EOF > /etc/docker/daemon.json
{
"insecure-registries" : ["$PRIMARY_IP:5000"]
}
EOF
# systemctl restart docker
Spin-up the local Gitlab server, runner, Batfish server and a private docker registry(assuming docker-compose is installed)
docker-compose up -d
Import the cEOS-lab image into the private Docker registry
docker import cEOS-Lab.tar.xz ceos:4.20.0F
docker tag ceos:4.20.0F $PRIMARY_IP:5000/ceos:4.20.0F
docker push $PRIMARY_IP:5000/ceos:4.20.0F
Import Docker executor image into the private Docker registry
docker tag networkci $PRIMARY_IP:5000/networkci
docker push $PRIMARY_IP:5000/networkci
Update .gitlab-ci.yml's PRIMARY_IP with the value of the PRIMARY_IP environment variable
sed -ri "s/(^\s*)PRIMARY_IP:.*$/\1PRIMARY_IP: ${PRIMARY_IP}/" .gitlab-ci.yml
Wait until the Gitlab server's status changes from starting
to healthy
:
watch docker ps
Register the Gitlab Runner
$ ./register-runner.sh
Running in system-mode.
Registering runner... succeeded runner=RunnerTo
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!
Login the Gitlab server URL http://localhost:9080 with username root
and password GitLabAdmin
, create a new public project called network-ci
.
Redirect the current repository to the new project's remote
git remote remove origin
git remote add origin http://$PRIMARY_IP:9080/root/network-ci.git
Make sure you're on the master branch.
git branch
* master
If not, create it:
git checkout -b master
Before we start demonstrating our CI pipeline, we need to build a simulation of the production network. To do that, first, generate the "current" configs for production (these configs are missing the Leaf-3 BGP peerings).
cd ./build
ansible-playbook -e @group_vars/current.yml -e buildenv=prod -e outpath=$ROOTPATH/prod/configs/ build.yml
Build the "production" topology (requires docker-topo to be installed)
cd ../prod
docker image pull networkop/alpine-host:latest
docker image tag networkop/alpine-host alpine-host
docker-topo --create topo.yml
Paste all the aliases provided in the output of the previous command into the current shell
alias Leaf-1='docker exec -it clos_Leaf-1 Cli'
alias Leaf-2='docker exec -it clos_Leaf-2 Cli'
Login Spine-1 and Spine-2 and verify that only two BGP peerings are configured and are in Established
state
Spine-2# sh ip bgp sum
BGP summary information for VRF default
Router identifier 1.1.1.200, local AS number 65100
Neighbor Status Codes: m - Under maintenance
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
10.0.255.1 4 65001 4 4 0 0 00:00:29 Estab 0 0
10.0.255.3 4 65002 4 4 0 0 00:00:29 Estab 0 0
First, have a look at the proposed data model:
cd $ROOTPATH
cat build/group_vars/intended.yml
This data model gets used to build the full network configurations for all devices, including all leaf and spine switches.
Push the current repo to the Gitlab server using root/GitLabAdmin as credentials:
git add -A .; git commit -m "new change"; git push -u origin master
Gitlab server automatically detects the CI pipeline in .gitlab-ci.yml
file and schedules to run in on our gitab runners. The pipeline is built out of 5 stages and 6 jobs, which are executed in the order specified in the stages
variable.
The first, build stage, contains two jobs that generate configs for lab and production network topologies respectively. They are saved as artifacts, which makes them available to all subsequent pipeline stages.
.generate: &generate_template
before_script:
- apk add --upgrade ansible
- pip3 install --upgrade netaddr
- mkdir -p $CI_PROJECT_DIR/outputs/configs
script:
- cd build
- ansible-playbook -e buildenv=$BUILDENV
-e ansible_python_interpreter=/usr/bin/python3
-e @group_vars/intended.yml
-e outpath=$CONF_DIR
build.yml
artifacts:
when: always
paths:
- ./outputs
generate_lab_configs:
<<: *generate_template
stage: build
variables:
BUILDENV: lab
generate_prod_configs:
<<: *generate_template
stage: build
variables:
BUILDENV: prod
The second, analysis stage runs a series of Batfish tests against the generated configurations and verifies that:
- The configs do not contain any unused or non-existent configuration objects
- All BGP peerings have been configured correctly and each leaf peers with each one of the spine switches
- The traceroute between Leaf-3 and Leaf-1/2 loopbacks is successful, traverses exactly two hops and takes two different paths(i.e satisfies the multipathing property of Clos topologies)
- In case of a failure of Spine-2 the traceroutes between leaf loopbacks still succeeds.
batfish:
stage: analysis
before_script:
- mkdir -p ./candidate/configs
- mkdir -p ./candidate-with-failure/configs
- cp ./outputs/configs/prod_* ./candidate/configs
- cp ./outputs/configs/prod_* ./candidate-with-failure/configs
- cp ./tests/batfish/node_blacklist ./candidate-with-failure/node_blacklist
script:
- ./tests/batfish/leaf-3.py --host $PRIMARY_IP --log ./outputs/batfish.txt
artifacts:
when: always
paths:
- ./outputs
The third, test stage builds a lab topology out of cEOS Docker images and runs a series of Robot Framework tests against it to confirm that:
- Leaf-1 and Leaf-2 loopbacks are learned by Leaf-3 via BGP
- The SVIs of those switches are reachable and Host-3 can reach both Host-1 and Host-2
The final Robot Framework report is uploaded to Gitlab as artifact and can be viewed from Gitlab web GUI, rendered by the built-in Github Pages.
lab_testing:
stage: test
before_script:
- dockerd --insecure-registry $LOCAL_REGISTRY &
- sleep 2
- mkdir -p $CI_PROJECT_DIR/outputs/robot
- docker image pull $LOCAL_REGISTRY/ceos:4.20.0F
- docker image tag $LOCAL_REGISTRY/ceos:4.20.0F ceos:4.20.0F
- docker image pull networkop/alpine-host:latest
- docker image tag networkop/alpine-host alpine-host
- pip3 install --upgrade git+https://github.com/networkop/arista-ceos-topo.git
- docker-topo --create ./tests/robot/topo.yml
- docker exec lab_Leaf-1 wfw Aaa
- sleep 120
script:
- cd ./tests/robot
- mkdir ./report
- validate_network.py --config test.yml --reportdir report
- cp report/* $CI_PROJECT_DIR/outputs/robot/
- ./has_failed.sh
artifacts:
when: always
paths:
- ./outputs
The fourth stage, called diff, dry-runs the new change against production network, generates configuration diffs for each device in the topology and saves them as artifacts for future review.
generate_diffs:
stage: diff
before_script:
- mkdir -p $CI_PROJECT_DIR/outputs/diffs
script:
- cd ./build
- ansible-playbook --diff --check
-e PROD_IP=$PRIMARY_IP
-e buildenv=$BUILDENV
-e outpath=$CI_PROJECT_DIR/outputs/diffs
-e confdir=$CONF_DIR
diff.yml
variables:
BUILDENV: prod
artifacts:
when: always
paths:
- ./outputs
The final job has to be triggered manually, with the assumption that the change reviewer first needs to be satisfied with the test results and generated diffs. The outputs of show ip route
and show arp
are captured before and after the change is pushed and saved as artifacts of the current pipeline run.
push_to_prod:
stage: push
script:
- cd ./build
- ansible-playbook -e PROD_IP=$PRIMARY_IP
-e buildenv=$BUILDENV
push.yml
when: manual
allow_failure: true
artifacts:
paths:
- ./build/snapshots
variables:
BUILDENV: prod
The first run of the pipeline should fail at the batfish job, which should detect a large number of issues. To view the output of batfish, we can navigate to Project -> CI/CD -> Pipelines -> Latest failed pipeline -> Batfish and either view it right there or click Browse
Job artifacts and find the batfish.txt
file.
Based on the provided output we can conclude that:
- Each configuration file contains an undefined data structure (e.g. Leaf-1's config at line 31)
- each configuration file contains an unused data structure (e.g. Leaf-1's config at line 33)
- All traceroutes from Leaf-3's loopback to Leaf-1 and Leaf-2 loopbacks have failed due to
NO_ROUTE
reason
If we browse to artifacts and examine production Leaf-1's config at lines 31 and 33 we will see the following:
This looks like a misspelt prefix list name which is used as a match in a route-map that controls which routes get redistributed into BGP. So this would also explain why the traceroutes between the loopbacks have failed as well. To correct that we can edit the routing.j2 file to make the prefix-list names match. Once done, we can re-run our pipeline
git add .; git commit -m "bug fix"; git push
The next pipeline run will also fail, however this time, the error is different:
This looks like the traceroute hasn't explored all the possible paths from one leaf to another, which can only happen if we haven't configured the BGP maximum-paths
setting. We can once again edit the routing.j2
and add the maximum-paths 2
right after the redistribution statement and re-run the pipeline again:
git add .; git commit -m "bug fix"; git push
The next run of the pipeline will fail on Robot tests. Specifically, the data plane connectivity tests that verify that Leaf-3 can reach SVIs of Leaf-1 and Leaf-2 is failing. If we browse artifacts collected for this job, we'll see a folder called robot
. Inside that folder, the log.html
file contains the outputs of Robot test run, including the snapshot of BGP RIB, captured right before the data plane tests. If we expand that test, we won't be able to see any of the remote 192.168.X.X
subnets.
If we go back to our configs, we should be able to see that since we're only redistributing loopbacks, none of the other directly connected subnets ends up in the BGP RIB. So let's fix that by allowing all directly connected networks to be redistributed:
!
route-map RMAP-CONNECTED-BGP permit 1000
!
Let's push the updates to Gitlab:
git add .; git commit -m "bug fix"; git push
Next time our pipeline would fail on the last, end-to-end reachability test that is run between Host-3 and Host-1/2.
This final test performs a ping from a directly connected host and it looks like, although Leaf-3 can reach both SVIs, Host-3 cannot. To fix that let's get a closer look at our access ports data model:
servers:
Leaf-1:
interfaces:
lab: eth3
prod: Ethernet10-11
vlan: 10
svi: 192.168.10.1/24
Leaf-2:
interfaces:
lab: eth3
prod: Ethernet10
vlan: 20
svi: 192.168.20.1/24
Leaf-3:
interfaces:
lab: eth3
prod: Ethernet10
vlan: 30
svi: 192.168.30.3/24
It looks like the IP assigned to Leaf-3's SVI is not the first IP of the subnet, which is how Leaf-1 and Leaf-2 are configured. Let's chang that and re-run our pipeline:
git add .; git commit -m "bug fix"; git push
This time all tests should complete successfully and the pipeline stops right before the change gets pushed into production. The penultimate job dry-runs the proposed changes against the production network, generates diffs and stores them as artifacts, like this example for Spine-2:
--- system:/running-config
+++ session:/ansible_1537901087-session-config
@@ -47,6 +47,9 @@
neighbor 10.0.254.3 remote-as 65002
neighbor 10.0.254.3 send-community
neighbor 10.0.254.3 maximum-routes 12000
+ neighbor 10.0.254.5 remote-as 65003
+ neighbor 10.0.254.5 send-community
+ neighbor 10.0.254.5 maximum-routes 12000
redistribute connected route-map RMAP-CONNECTED-BGP
!
end
Finally, once the change reviewer is happy with the test results, configs and diffs, he/she may chose to push the change directly into production. The final job, push_to_prod
, captures the state of IPv4 routing table and ARP tables before and after the change and generates deltas to see what routes/ARPs have been added or removed as the result. These state snapshots are stored as artifacts in outputs/snapshots
folder. For example, this is how added IPv4 routes delta is reported by the last job:
###########################
# IP routing table added: #
###########################
[
"1.1.1.1/32",
"1.1.1.200/32",
"1.1.1.100/32",
"192.168.20.0/24",
"10.0.254.0/31",
"1.1.1.2/32",
"10.0.254.2/31",
"192.168.30.0/24",
"192.168.10.0/24",
"10.0.255.0/31",
"1.1.1.3/32",
"10.0.255.2/31"
]
------------------------------------------------------
#############################
# IP routing table removed: #
#############################
[]
------------------------------------------------------