pantz.org banner
SaltStack Minion communication and missing returns
Posted on 05-30-2016 04:42:57 UTC | Updated on 08-02-2016 00:27:12 UTC
Section: /software/saltstack/ | Permanent Link

Setting up SaltStack is a fairly easy task. There is plenty of documentation here. This is not an install tutorial, this is an explanation and trouble shooting of what is going on with SaltStack Master and Minion communication. Mostly when using the CLI to send commands from the Master to the Minions.

Basic Check List

After you have installed your Salt Master and your Salt Minions software the first thing to do after starting your Master is open your Minion's config file in /etc/salt/minion and fill out the line "master: " to tell the Minion where his Master is. Then start/restart your Salt Minion. Do this for all your Minions.

Go back to the Master and accept all of of the Minions keys. See here on how to do this. If you don't see a certain Minions key here are some things you should check.

  1. Is your Minion and Master running the same software version? The Master can usually work at a higher version. Try to keep them the same if possible.
  2. Is your salt-Minion service running? Make sure it is set to run on start as well.
  3. Has the Minions key been accepted by the Master? If you don't even see a key request from the Minion then the Minion is not even talking to the Master .
  4. Does the Minion have an unobstructed network path back to TCP port 4505 on the Master? The Minions initialize a TCP connection back to the Master so they don't need any ports open. Watch out for those Firewalls.
  5. Check your Minions log file in /var/log/salt/minion for key issues or any other issues.

Basic Communication

Now lets say you have all of basic network and and key issues worked out and would like to send some jobs to your Minions. You can do this via the Salt CLI. Something like salt \* cmd.run 'echo HI'. This is considered a job by Salt. The Minions get this request and run the command and return the job information to the Master. The CLI talks to the Master who is listening for the return messages as they are coming in on the ZMQ bus. The CLI then reports back that status and output of the job.

That is a basic view of this process. But, sometimes Minions don't return job information. Then you ask yourself what the heck happened. You know the Minion is running fine. Eventually you find out you don't really understand Minion Master job communication at all.

Detailed Breakdown of Master Minion CLI Communication

By default when the job information gets returned to the Master and is stored on disk in the job cache. We will assume this is the case below.

The Salt CLI is just an small bit of code that interfaces with the API SaltStack has written that allows anyone to send commands to the Minions programmatically. The CLI is not connected directly to the Minions when the job request is made. When the CLI makes a job request, is handed to the Master to fulfill.

There are 2 key timeout periods you need be aware of before we go into a explanation of how a job request is handled. They are "timeout" and "gather_job_timeout".

When the CLI command is issued, the Master gathers a list of Minions with valid keys so it knows which Minions are on the system. It validates and filters the targeting information from the given target list and sets that as its list (targets) of Minions for the job. Now the Master has a list of who should return information when queried. The Master takes the requested command, target list, job id, and a few pieces of info, and broadcasts a message on the ZeroMQ bus to all of Minions. When all Minions get the message, they look at the target list and decide if they should execute the job or not. If the Minion sees he is in the target list he executes the job. If a Minion sees he is not part of the target list, he just ignores the message. The Minion that decided to run the command creates an local job id for the job and then performs the work.

While the Minions are working their jobs the CLI is waiting for the first initial timeout period (-t or timeout:) to start. When that hits, the CLI sends sends the first "find_job" query. This kicks off the gather_job_timeout timer. The Minions receive the the first find_job request with the original job_id. If they are still running the job, the Minion responds to "find job" request with a status of "still working" or "Job Finished". If a Minion does not respond to the request within the gather_job_timeout time period (10 secs), the CLI marks the Minion as "non responsive" for the polling interval. All Minions will keep being queried on the gather_job_timeout interval. If the Minions do not reply within this timeout, or all report that they are no longer running the job in question, the CLI command will return. If one of more minions replies that they are still running the job, the initial timeout is triggered again and the cycle repeats.

The CLI will show the output from the Minions as they finish their jobs. For the Minions that did not respond, but are connected to the Master, you will see the message "Minion did not return". If a Minion does not even look like it has a TCP connection with the Master, you will see "Minion did not return. [Not connected]".

By this time the Master should have marked the job as finished. The jobs info should now be available in the job cache. The above explanation is a high level explanation of how Master and Minions communicate. There are more details to this process than the above info, but this should give you a basic idea of how it works.

Takeaways From This Info

  1. There is no defined period on how long a job will take. The job will finish when the last responsive Minion has said it is done.
  2. If a Minion is not up or connected when a job request it sent out, then the Minion just misses that job. It is _not_ queued by the Master, and sent at a later time.
  3. Currently there is no hard timeout to force the Master to stop listening after a certain amount of time.
  4. If you set your timeout (-t) to be something silly like 3600, then if even one Minion is not responding the CLI will wait the full 3600 seconds to return. Beware!

Missing Returns

Sometimes you know there are Minions up and working, but you get "Minion did not return" or you did not see any info from the Minion at all before the CLI timed out. It is frustrating, as you can send the same Minion that just failed a job and it finishes it with no problem. There can be many reasons for this. Try/check the following things.

Del.icio.us! | Digg Me! | Reddit!

Related stories


RSS Feed RSS feed logo
About


3com
3ware
alsa
alsactl
alsamixer
amd
android
apache
areca
arm
ati
auditd
awk
badblocks
bash
bind
bios
bonnie
cable
carp
cat5
cdrom
cellphone
centos
chart
chrome
cifs
cisco
cloudera
comcast
commands
comodo
compiz-fusion
corsair
cpufreq
cpufrequtils
cpuspeed
cron
crontab
crossover
cu
cups
cvs
database
dbus
dd
dd_rescue
ddclient
debian
decimal
dhclient
dhcp
diagnostic
diskexplorer
disks
dkim
dns
dos
dovecot
drac
dsniff
dvdauthor
e-mail
echo
editor
emerald
ethernet
expect
ext3
ext4
fat32
fedora
fetchmail
fiber
filesystems
firefox
firewall
flac
flexlm
floppy
flowtools
fonts
format
freebsd
ftp
gdm
gmail
gnome
greasemonkey
greylisting
growisofs
grub
hacking
hadoop
harddrive
hba
hex
hfsc
html
html5
http
https
idl
ie
ilo
intel
ios
iperf
ipmi
iptables
ipv6
irix
javascript
kde
kernel
kickstart
kmail
kprinter
krecord
kubuntu
kvm
lame
ldap
linux
logfile
lp
lpq
lpr
maradns
matlab
memory
mencoder
mhdd
mkinitrd
mkisofs
moinmoin
motherboard
mouse
movemail
mplayer
multitail
mutt
myodbc
mysql
mythtv
nagios
nameserver
netflix
netflow
nginx
nic
ntfs
ntp
nvidia
odbc
openbsd
openntpd
openoffice
openssh
openssl
opteron
parted
partimage
patch
perl
pf
pfflowd
pfsync
photorec
php
pop3
pop3s
ports
postfix
power
procmail
proftpd
proxy
pulseaudio
putty
pxe
python
qemu
r-studio
raid
recovery
redhat
router
rpc
rsync
ruby
saltstack
samba
schedule
screen
scsi
seagate
seatools
sed
sendmail
sgi
shell
siw
smtp
snort
solaris
soundcard
sox
spam
spamd
spf
sql
sqlite
squid
srs
ssh
ssh.com
ssl
su
subnet
subversion
sudo
sun
supermicro
switches
symbols
syslinux
syslog
systemrescuecd
t1
tcpip
tcpwrappers
telnet
terminal
testdisk
tftp
thttpd
thunderbird
timezone
ting
tls
tools
tr
trac
tuning
tunnel
ubuntu
unbound
vi
vpn
wget
wiki
windows
windowsxp
wireless
wpa_supplicant
x
xauth
xfree86
xfs
xinearama
xmms
youtube
zdump
zeromq
zic
zlib