RunBook
Important websites
- Gearmand - https://gearmandmonitor.smsmanagerpro.com
- Memcached - https://memcachedmonitor.smsmanagerpro.com
- Proxmox - https://192.168.25.49:8006
- NinjaOne - https://app.ninjarmm.com
- Sim Farm 1 - https://192.168.25.96
- Kannel - http://192.168.25.93:13000/status?password=shuser
Daily Routine
Most of these already have an alert system in place so if anything goes off, it would send a text message to the admin's phone, but it's also a good practice to do a daily routine check on the system.
- Make sure all database connection are still intact by logging into each MongoDB server and issue the command “ps aux | grep mongo”, if it returns with a mongos process, we’re still good. If not, we will have to relaunch it.
- Login into proxmox cluster at https://192.168.25.49:8006 to make sure all VMs are still online.
- Check to make sure all Sim channels are operational.
- Calculate and update charges for Teltik
- Login into ninjaone to make sure no alert is going off
- Login into iDrac to make sure no alert is going off
- Check to make sure kannel is up and running
- Delete any pool lock from memcached, it will start with pool_ as the name.
- Check the optout ratio from yesterday break down by the pool
- Compare stat between vendor’s and ours to make sure we’re still matching
- Do a round of pool testing to see if the pool is still delivering or if it has any delay
- Login to NDB server and check to make sure all sql connectors are still available. Login into the mngr1 server, then run these commands
Sudo -s Ndm_mgm Show There should be 4 ndbd nodes, 2 mgmd nodes and 7 mysqld nodes
Known issues and how to fix them
Error codes are unclear
- We have documented most of the codes that we see on a daily basis, but any new error code will need to be explained by the vendor, and each code is different from one vendor to another so we have to treat them individually per vendor, even though the code number is the same.
Need to delete drops
- The only way to delete a drop is to enter a date and time in column “deleted_at”, in table dropmanagement in MySQL. I will make a delete button on the user interface later.
Ashlie needs to modify a drop
- The only way to modify a drop is to modify it from table dropmanagement in MySQL. I will make an edit button to do it from the interface later.
FAILED_NODIDS
- This error means either the daily cap for the pool for the carrier has already maxed out, or Ashlie is sending more than the pool can allow to send.
Stat not updated or having delay
- Check if there is a lock in memcached, if there is, delete it or wait for 30 minutes when the lock expires by itself.
A drop said it’s DEPLOYING but I’m not seeing any movement.
- Most likely because there are still some messages in the queue for that pool that’s going out. Check https://gearmandmonitor.smsmanagerpro.com to see how the queue is looking.
Weekly House Cleaning and Backup
Every week at 4AM PST Monday morning, the system will do a full power cycle system wise, the power cycle takes about 1 minute. All components of the system will automatically reconnect after the power cycle.
Every night at 1AM, the system will do a full database backup at Frontend Server, the person in charge will need to manually download the backup and store it accordingly everyday. Here is the backup strategy
- We will have the same backup copies on Google Drive, Frontend server and local workstation.
- We will keep these backup versions:
- Everyday of the last 7 days
- Sunday backup of the last 4 week
- Last day of the month backup of the last 12 months
- Last day of the year backup of the last 5 year
Export Weekly log file for JB
Login into MongoDB Compass, and move to database “database”. On collection history, issue this command
{$and: [{ date_sent: { $gte: ISODate('2024-01-19T06:00:00.000Z') }},{ date_sent: { $lte: ISODate('2024-01-26T05:59:59.000Z') }}]}
Replace the date 2024-01-19 with another date 7 days from it, do the same with 2024-01-26, then run it, after that, download the query result.
Do the same with collection clickhistory but use this command
{$and: [{ click_time: { $gte: ISODate('2024-01-19T06:00:00.000Z') }},{ click_time: { $lte: ISODate('2024-01-26T05:59:59.000Z') }}]}
Tips & Tricks
I need to find all of the messages that were sent to a phone number
- In SMP’s user interface, navigate to SMS History, in the Search field, enter the phone number and it will pull up all of the messages that were sent to it.
I need to unsubscribe certain phone numbers
- Send a post request to https://api.smsmanagerpro.com/unsubphone with a field call “phones” and all of the phone numbers can be separated by a comma
I want to reconcile to make sure the vendor’s bill matches with ours
- Login into SMP, go to Dashboard, right under the Expenses Accounting, select the Start Date and End Date, then click View, it should show the total messages and the cost
- A good rule of thumb is our number will be a bit higher than the vendor’s bill, not by a lots but typically higher because our system’s timezone runs on UTC, we also count the help and opt-out confirmation messages while some vendors don’t and some do.
Kannel does not want to start
- Login to http://192.168.25.93:9004/ and stop everything. Then ssh into the kannel server, and issue these commands
Sudo -s Cd /usr/local/kannel/sbin Rm -f kannel.store kannel.store.bak smsbox.stdout.log* access.log smpp_dave.log smsbox.log
- After that, go back to the interface and start up the processes in this order
Bearerbox:bearbox_0 KNL_SQLBOX KNL_SMSBOX
Our VPN is offline
- This is a major issue, and the only way to resolve this is to open a ticket to Tier.net to restart the firewall box. When opening the ticket, make sure the Priority is High, use the following template:
Could you do me a favor and do a power cycle reboot for the sonicwall TZ400. Location: ET4, Middle shelf on top of the TZ270 Device: Sonicwall TZ400 Instruction: just unplug the power cable, wait for 15s and plug it back in.
- Once the VPN is back up, everything should automatically reconnect
How to check the sim route?
- When you login into the sim route interface, navigate to Gateway Setting -> AT Command. Any port that has a red dot next to it will need to restart, so in the same page, select the port number in the drop down under Module Operations, click restart, and it will restart and bring the port back up (sort of like restarting your phone to pick up the signal again).
How to relaunch a mongos process
- SSH into the server and sudo -s to switch to root (we need to be root to increase the max open file limit). Then issue the command
ulimit -n 100000 && sudo rm -f /home/mongodbstorage/mongos.log && sudo touch /home/mongodbstorage/mongos.log && sudo chown -R mongod:mongod /home/mongodbstorage/mongos.log && /usr/bin/mongos --config /etc/mongos.conf
What happen if gearmand or memcached goes down?
- This rarely happens, the only way it could happen are:
- gearmand maxed out its memory capacity, should be in the millions in the queue for this to happen
- memcached maxed out its capacity.
- The host server goes down or reboot for whatever reason and the VMs migrated automatically to another host server the way proxmox was designed to do
- Steps to follow when this happens
- Wait for the VM to come back up, or login to proxmox to bring the server back up
- Memcached will automatically relaunch memcached server when comes online, but for gearmand, SSH to the server and issue this command as root
ulimit -n 100000 && gearmand --listen 192.168.25.61 --port 8888 --log-file /var/log/gearmand.log -d
- After that, SSH to the workers 1 - 4 and issue these 2 commands to reconnect to the gearmand and memcached servers
Pkill supervisord rm -f /tmp/drop* && ulimit -n 100000 && supervisord -c /var/www/html/etc/supervisord.conf