
This is some introductory material for future DevOps and SRE engineers to get familiar with core networking concepts. Understanding networking is crucial for managing and troubleshooting infrastructure effectively.
Although this guide is not exhaustive, it covers essential topics that every DevOps and SRE should know, especially the stuff around how to work with networking using CLI's and common tools.
My personal view of being the DevOps or SRE is more inclined towards Infrastructure as Code and Automation, Observability, CI/CD and core networking rather than typical SysOps with some software engineering skills.
Note:
The OSI (Open Systems Interconnection) model is a conceptual framework used to understand and implement network protocols in seven layers. Each layer serves a specific function and communicates with the layers directly above and below it:
But you need to remember that the OSI model is a theoretical framework, and real-world networking protocols often do not fit neatly into these layers.
The OSI model is greatly explained here.
Every connection between devices on a network uses IP addresses and ports to identify the source and destination of the data being transmitted.
You can check what connections are currently established on your machine using the following commands:
# On Linux and macOS
netstat -tuln
# On Windows
$ netstat -an
Proto Local Address Foreign Address State
TCP 0.0.0.0:22 0.0.0.0:0 LISTENING
TCP 0.0.0.0:443 0.0.0.0:0 LISTENING
TCP 0.0.0.0:1433 0.0.0.0:0 LISTENING
In this case, you can see that there are three services listening on ports 22 (SSH), 443 (HTTPS), and 1433 (MSSQL running inside container).
TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are two core protocols used for transmitting data over the internet:
In order to check reachability of a host on an IP network, you can use the ping command. It sends ICMP Echo Request messages to the target host and waits for Echo Reply messages.
ping google.com
DNS (Domain Name System) translates human-readable domain names (like google.com) into IP addresses that computers use to identify each other on the network.
To look-up DNS records, you can use the nslookup or dig commands:
# Using nslookup
nslookup google.com
# Using dig
dig google.com
In order to trace the path that packets take from your machine to a destination host, you can use the traceroute command (or tracert on Windows):
# On Linux and macOS
traceroute google.com
# On Windows
tracert google.com
It can be done using telnet or nc (netcat) commands to check if a specific port on a remote host is open and reachable:
# Using telnet
telnet google.com 80
# Using nc
nc -zv google.com 80
To heck what network interfaces are available on your machine, you can use the following commands:
# On Linux
ip addr show
# On Windows
ipconfig /all
In order to check routing table, you can use:
# On Linux
ip route show
# On Windows
route print
IP addresses are unique identifiers assigned to devices on a network. They allow devices to communicate with each other and are essential for routing traffic across the internet.
The basic structure of an IP address consists of four octets (for IPv4) separated by periods, such as 192.168.1.102 or 172.17.0.2. Each octet can range from 0 to 255, providing a total of approximately 4.3 billion unique addresses in IPv4.
To check your current IP address, you can use the commands mentioned earlier for checking network interfaces.
$ ip addr show
(...)
36: eth0@if37: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
And one very important concept to understand is subnetting. Subnetting is the practice of dividing a larger network into smaller, more manageable sub-networks (subnets). This helps improve network performance, security, and organization.
The specific fragment of our IP address /16 is called CIDR notation (Classless Inter-Domain Routing) and indicates the subnet mask.
So, while our address is 172.17.0.2/16, which means that the first 16 bits (the first two octets) are used for the network portion of the address, and the remaining bits are used for host addresses within that network.
So:
172.170.2172.17.0.0 to 172.17.255.255.And you lose some addresses for network and broadcast addresses:
172.17.0.0172.17.255.255The most common subnet sizes are:
| CIDR Notation | Subnet Mask | Number of Hosts | Description |
|---|---|---|---|
| /24 | 255.255.255.0 | 256 | Class C subnet |
| /16 | 255.255.0.0 | 65,536 | Class B subnet |
| /8 | 255.0.0.0 | 16,777,216 | Class A subnet |
In order to calculate subnet sizes and ranges, you can use online tools like Subnet Calculator or command-line tools like ipcalc:
ipcalc 172.17.0.2/16
But of course you can do it manually as well, by understanding binary math e.g. to calculate number of usable IP addresses in a subnet given a CIDR like /27:
32 - 27 = 5 host bits.2^5 = 32 total addresses.32 - 2 = 30 usable addresses.In short, the table above, can be extended to:
| CIDR | Subnet Mask | # of Hosts | Typical Use Case |
|---|---|---|---|
| /8 | 255.0.0.0 | 16,777,216 | Very large networks (e.g., 10.0.0.0/8) |
| /12 | 255.240.0.0 | 1,048,574 | Large private networks (e.g., 10.0.0.0/12) |
| /16 | 255.255.0.0 | 65,536 | Medium-sized networks (e.g., 172.16.0.0/16) |
| /20 | 255.255.240.0 | 4,094 | Medium-sized subnets, cloud VPCs |
| /21 | 255.255.248.0 | 2,046 | Cloud subnets, branch offices |
| /22 | 255.255.252.0 | 1,022 | Cloud subnets, small data centers |
| /23 | 255.255.254.0 | 510 | Two contiguous /24s, small LANs |
| /25 | 255.255.255.128 | 126 | Half of a /24, small segments |
| /26 | 255.255.255.192 | 62 | Small VLANs, point-to-point links |
| /27 | 255.255.255.224 | 30 | Very small VLANs, DMZs, device groups |
| /28 | 255.255.255.240 | 14 | Device clusters, point-to-point, labs |
| /29 | 255.255.255.248 | 6 | WAN links, router interconnects |
| /30 | 255.255.255.252 | 2 | Point-to-point links (router-router) |
So as you probably might think, we have two types of IP addresses, private and public:
192.168.0.0 to 192.168.255.255.172.16.0.0 to 172.31.255.255 (e.g. Docker uses 172.17.0.0/16 for its containers).10.0.0.0 to 10.255.255.255 (large organizations often use these).In order to do use, you can for example use curl to query an external service that returns your public IP address:
$ curl ifconfig.me
185.21.11.177 # Obviously fake IP
NAT (Network Address Translation) is a networking technique used to modify network address information in the IP header of packets while they are in transit across a router or firewall. Its primary purpose is to allow multiple devices on a private network to share a single public IP address when accessing external networks, such as the internet.
With NAT, devices inside a local network use private IP addresses (like 192.168.x.x or 10.x.x.x) that are not routable on the public internet. When these devices communicate with external servers, the NAT device (usually a router) translates their private IP addresses to its own public IP address. It keeps track of these connections so that when responses come back, it knows which internal device should receive each response.
NAT improves security by hiding internal network structure and conserves public IP addresses, which are limited in number. There are several types of NAT, including static NAT (one-to-one mapping), dynamic NAT (many-to-many mapping), and the most common, PAT (Port Address Translation, also known as "NAT overload"), which allows many devices to share a single public IP using different port numbers.
So taking all of this into account, in other words:
Here is an ASCII diagram to illustrate practical subnetting:
[ Internet ]
|
+---------------+
| Main Router |
| 192.168.1.1 |
+---------------+
|
+-----------------------------+
| |
[192.168.1.x devices] +-------------------+
(PCs, phones, etc.) | 2nd Router/VLAN |
192.168.1.2-254 | WAN: 192.168.1.2 |
| LAN: 10.0.0.1 |
+-------------------+
|
+--------------------------+
| 10.0.0.x devices |
| (IoT, guests, etc.) |
| 10.0.0.2-254 |
+--------------------------+
Or more complex scenario:
[ Internet ]
|
+---------------------+
| Main Router |
| (Core Firewall) |
+---------------------+
|
+--------------------------+-------------------------------+-------------------------------+
| | | |
+-------------------+ +---------------------+ +---------------------+ +---------------------+
| Main Network | | Web Servers | | Apps & Containers | | Databases |
| 192.168.1.0/27 | | 10.0.1.0/24 | | 10.0.2.0/16 | | 10.0.3.0/27 |
| (e.g. .1 = router | | (e.g. .1 = gateway) | | (e.g. .1 = gateway) | | (e.g. .1 = gateway) |
| .2 = notebook) | | .2-.254 = servers | | .2-.65534 = apps | | .2-.30 = db hosts |
+-------------------+ +---------------------+ +---------------------+ +---------------------+
Even with a such segmented network architecture, there are still challenges to address and you will have occasional connectivity issues between different subnets, so proper routing rules and firewall policies must be implemented to allow or restrict traffic as needed.
For example there could be a scenario where you would not be able to reach some networks but no others, so you will have to add for example temporary routes using:
ip route add 10.0.1.0/24 via 192.168.1.1
Remember that you can always check your routing table using: ip route show.
DNS as mentioned before is used to resolve domain names to IP addresses. In a complex network setup with multiple subnets, you might have your own internal DNS server to manage local domain names and their corresponding IP addresses.
As already mentioned, you can use nslookup or dig to query DNS records:
$ nslookup morele.net
Server: 192.168.127.1
Address: 192.168.127.1#53
Non-authoritative answer:
Name: morele.net
Address: 104.18.10.64
Name: morele.net
Address: 104.18.11.64
$ dig morele.net
; <<>> DiG 9.18.41 <<>> morele.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45397
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;morele.net. IN A
;; ANSWER SECTION:
morele.net. 0 IN A 104.18.10.64
morele.net. 0 IN A 104.18.11.64
The dig offers more detailed output compared to nslookup, including sections for the question, answer, authority, and additional information.
And as you can notice, both commands return two IP addresses for the domain morele.net, indicating that it uses multiple A records for load balancing or redundancy.
The A stands for Address record, which maps a domain name to an IPv4 address. There are also other types of DNS records, such as:
Once you dig for example Github, you will notice that it uses CNAME (Canonical Name) records to point to other domain names:
$ dig www.github.com CNAME
;; ANSWER SECTION:
www.github.com. 0 IN CNAME github.com.
The DNS system uses caching to improve performance and reduce latency. When a DNS resolver looks up a domain name, it stores the result in its cache for a specified period, known as the TTL (Time to Live).
So sometimes when you change DNS records, it may take some time for the changes to propagate due to caching and the only way to speed it up if possible is to use:
# Windows
ipconfig /flushdns
# Linux
sudo systemd-resolve --flush-caches
Here is an example of a simple DNS configuration using BIND (Berkeley Internet Name Domain) for a local network:
zone "example.local" {
type master;
file "/etc/bind/zones/db.example.local";
};
And the corresponding zone file (db.example.local):
$TTL 604800
@ IN SOA ns1.example.local. admin.example.local. (
2024091301 ; Serial
604800 ; Refresh
86400 ; Retry
2419200 ; Expire
604800 ) ; Negative Cache TTL
;
@ IN NS ns1.example.local.
ns1 IN A 192.168.1.10
www IN A 192.168.1.20
api IN A 192.168.1.30
But remember that DNS configuration can vary significantly based on the DNS server software being used and the specific requirements of your network and hope that the DevOps guy, is not going to be responsible for managing DNS servers as well ๐ชฌ.
For home-lab purposes, you can always use just a simple hosts (or /etc/hosts on Linux) file to map hostnames to IP addresses without setting up a full DNS server:
192.168.1.10 ns1.example.local
192.168.1.20 www.example.local
192.168.1.30 api.example.local
192.168.1.40 example.local
127.0.0.1 k3d.local
Or you can use lightweight DNS servers like dnsmasq or Pi-hole to manage DNS for your home-lab network. These tools can provide DNS resolution, DHCP services, and ad-blocking capabilities in a simple and efficient manner.
For example, here is a config snippet for dnsmasq:
# /etc/dnsmasq.conf
address=/example.local/192.168.1.40
address=/ns1.example.local/192.168.1.10
address=/www.example.local/192.168.1.20
address=/api.example.local/192.168.1.30
When configuring networking for DevOps and SRE tasks, it's crucial to consider security aspects to protect your infrastructure and data:
SecOps Team, you must need to know how to enable firewall and allow or deny certain traffic.iptables, ufw, or cloud provider firewalls) to control incoming and outgoing traffic based on predefined security rules. Only allow necessary ports and protocols.While Alpine does not come with a firewall enabled by default, you can install and configure iptables or use nftables for more advanced firewall management. The same applies to most container images, so you need to be aware of that.
But usually Alpine is used only in containers, while hosts use for example Ubuntu Server or CentOS.
On the example of Ubuntu, we can use ufw (Uncomplicated Firewall) to manage firewall rules easily:
# Check if ufw is installed
sudo ufw status
# Enable ufw
sudo ufw enable
# Default deny incoming traffic
sudo ufw default deny incoming
# Default allow outgoing traffic
sudo ufw default allow outgoing
# Allow SSH (port 22)
sudo ufw allow 22/tcp
# Allow HTTP (port 80) and HTTPS (port 443)
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
# Allow to access from specific addresses only to MSSQL port 1443
sudo ufw allow from 10.0.32.3 to any port 1443
# Allow to access from specific subnet to PostgreSQL port 5432
sudo ufw allow from 10.0.1.0/24 to any port 5432
Those are only core concepts that you must be aware of, but security is a vast topic and requires continuous learning and adaptation to new threats and vulnerabilities. Always stay updated with the latest security best practices and guidelines relevant to your specific environment and technologies in use.
This is pretty much on of the most important parts of networking for DevOps and SRE's - The High Availability (HA) and Load Balancing (LB) concepts.
Where we want to make sure that our services are always available and can handle high traffic loads without downtime.
So first of first, in order to check how much traffic is coming to our services, we can use monitoring tools like Prometheus and Grafana to visualize network traffic and performance metrics.
Then we can use load balancers to distribute incoming traffic across multiple servers or instances. This helps prevent any single server from becoming overwhelmed with requests, ensuring better performance and reliability.
In order to check the 'connection load' we can use netstat to check the number of active connections to our service:
$ netstat -an | grep :80 | grep ESTABLISHED | wc -l
356123 # Number of established connections to port 80
In this case we can assume that our web server is handling 356123 active connections, which in a real-world scenario would be a lot but not great or terrible either.
So to implement load balancing, we can use tools like HAProxy, Nginx, Traefik, or cloud-based load balancers provided by AWS, Azure, or GCP.
But while on-premise, the Nginx is one of the most popular choices for load balancing HTTP and HTTPS traffic:
# /etc/nginx/sites-available/load_balancer.conf
upstream backend {
server 10.0.1.80:8080;
server 10.0.1.81:8081;
server 10.0.1.81:8082;
}
server {
listen 80;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
Another quite popular is for example Trefik:
# traefik.yml
entryPoints:
web:
address: ":80"
api:
dashboard: true
providers:
file:
filename: ./dynamic.yml
log:
level: INFO
# dynamic.yml
http:
routers:
my-router:
rule: "Host(`internal.local`)"
service: my-backend
entryPoints:
- web
services:
my-service:
loadBalancer:
servers:
- url: "http://10.0.1.80:8080"
- url: "http://10.0.1.80:8080"
- url: "10.0.1.80:8080"
# Enable the load balancer configuration
sudo ln -s /etc/nginx/sites-available/load_balancer.conf /etc/nginx/sites-enabled/
# Test Nginx configuration
sudo nginx -t
# Restart Nginx to apply changes
sudo systemctl reload nginx
Where we can define multiple backend servers in the upstream block, and Nginx will distribute incoming requests to these servers based on the default round-robin method, but it can be changed to other methods as well (e.g. weighted round robin or with least connections; more prefered actually).
Please note that in real production scenario you will also not only add health checks but also SSL termination:
server {
listen 443 ssl;
server_name your.corporate.domain;
ssl_certificate /etc/nginx/certs/corporate.crt;
ssl_certificate_key /etc/nginx/certs/corporate.key;
# If you have a CA bundle:
ssl_trusted_certificate /etc/nginx/certs/ca_bundle.crt;
location / {
proxy_pass http://your_backend;
}
}
Both have their own built-in load balancing mechanisms, but they can also integrate with external load balancers.
In case of Kubernetes, it has a built-in service type called LoadBalancer that can automatically provision an external load balancer (if supported by the cloud provider) to distribute traffic to pods.
There is also a HPA (Horizontal Pod Autoscaler) that can automatically scale the number of pod replicas based on CPU utilization or other metrics, helping to handle increased traffic loads.
Swarm, has also built-in load balancing capabilities. When you create a service in Docker Swarm, it automatically distributes incoming requests across the available replicas of that service:
services:
web:
image: nginx:latest
ports:
- "80:80"
deploy:
replicas: 3
update_config:
parallelism: 2
delay: 10s
restart_policy:
condition: on-failure
For more advanced load balancing techniques, you can also consider:
Containers use virtual networking to allow communication between containers and the host system. Docker, for example, creates a default bridge network that allows containers to communicate with each other using their IP addresses.
Typically, Docker assigns IP addresses from the 172.17.0.0/16 subnet to containers on this bridge network.
While containers can communicate using their IP addresses, it's often more convenient to use container names as hostnames. Docker's embedded DNS server resolves these names to the appropriate IP addresses.
Kubernetes (K8s) extends this concept further by providing a flat network namespace for all containers, allowing them to communicate with each other using their names without needing to know their IP addresses.
But it also uses different networking models like CNI (Container Network Interface) to manage pod-to-pod and pod-to-service communication, and it typically uses large address spaces like 10.244.0.0/16.
You can create a simple docker-compose.yml file to define multiple services and their networking configurations:
services:
redis:
image: redis:7
networks:
redis_net:
ipv4_address: 172.20.0.10
api:
image: mikelogaciuk/dummy-api:latest
depends_on:
- redis
- db
networks:
redis_net:
ipv4_address: 172.20.0.20
api_net:
ipv4_address: 172.21.0.10
db_net:
ipv4_address: 172.22.0.10
db:
image: postgres:16
environment:
POSTGRES_PASSWORD: example
networks:
db_net:
ipv4_address: 172.22.0.20
networks:
redis_net:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/24
api_net:
driver: bridge
ipam:
config:
- subnet: 172.21.0.0/24
db_net:
driver: bridge
ipam:
config:
- subnet: 172.22.0.0/24
Note: The code for my containerized dummy API is available on GitHub and it includes Static Application Security Testing (SAST) using Semgrep and Grype & Syft.
It has three endpoints: /, /health, and /error to simulate normal and error responses.
When working with cloud providers like AWS, Azure, or GCP, understanding their networking concepts is crucial. Each provider has its own terminology and services for managing networking, but the most common idea is that it is still based on core networking principles.
So I will not go into details here (because otherwise I would need to write a book).
That's why I will mention few core concepts:
For more details, you can refer to the documentation of each cloud provider.