RunBook

From PM Wiki
Jump to navigation Jump to search

Important websites

Daily Routine

Most of these already have an alert system in place so if anything goes off, it would send a text message to the admin's phone, but it's also a good practice to do a daily routine check on the system.

  • Make sure all database connection are still intact by logging into each MongoDB server and issue the command “ps aux | grep mongo”, if it returns with a mongos process, we’re still good. If not, we will have to relaunch it.
  • Login into proxmox cluster at https://192.168.25.49:8006 to make sure all VMs are still online.
  • Check to make sure all Sim channels are operational.
  • Calculate and update charges for Teltik
  • Login into ninjaone to make sure no alert is going off
  • Login into iDrac to make sure no alert is going off
  • Check to make sure kannel is up and running
  • Delete any pool lock from memcached, it will start with pool_ as the name.
  • Check the optout ratio from yesterday break down by the pool
  • Compare stat between vendor’s and ours to make sure we’re still matching
  • Do a round of pool testing to see if the pool is still delivering or if it has any delay
  • Login to NDB server and check to make sure all sql connectors are still available. Login into the mngr1 server, then run these commands
Sudo -s
Ndm_mgm
Show
There should be 4  ndbd nodes, 2 mgmd nodes and 7 mysqld nodes

Known issues and how to fix them

Error codes are unclear

We have documented most of the codes that we see on a daily basis, but any new error code will need to be explained by the vendor, and each code is different from one vendor to another so we have to treat them individually per vendor, even though the code number is the same.

Need to delete drops

The only way to delete a drop is to enter a date and time in column “deleted_at”, in table dropmanagement in MySQL. I will make a delete button on the user interface later.

Ashlie needs to modify a drop

The only way to modify a drop is to modify it from table dropmanagement in MySQL. I will make an edit button to do it from the interface later.

FAILED_NODIDS

This error means either the daily cap for the pool for the carrier has already maxed out, or Ashlie is sending more than the pool can allow to send.

Stat not updated or having delay

Check if there is a lock in memcached, if there is, delete it or wait for 30 minutes when the lock expires by itself.

A drop said it’s DEPLOYING but I’m not seeing any movement.

Most likely because there are still some messages in the queue for that pool that’s going out. Check https://gearmandmonitor.smsmanagerpro.com to see how the queue is looking.


Weekly House Cleaning and Backup

Every week at 4AM PST Monday morning, the system will do a full power cycle system wise, the power cycle takes about 1 minute. All components of the system will automatically reconnect after the power cycle.

Every night at 1AM, the system will do a full database backup at Frontend Server, the person in charge will need to manually download the backup and store it accordingly everyday. Here is the backup strategy

  • We will have the same backup copies on Google Drive, Frontend server and local workstation.
  • We will keep these backup versions:
  • Everyday of the last 7 days
  • Sunday backup of the last 4 week
  • Last day of the month backup of the last 12 months
  • Last day of the year backup of the last 5 year


Export Weekly log file for JB

Login into MongoDB Compass, and move to database “database”. On collection history, issue this command

{$and: [{ date_sent: { $gte: ISODate('2024-01-19T06:00:00.000Z') }},{ date_sent: { $lte: ISODate('2024-01-26T05:59:59.000Z') }}]}

Replace the date 2024-01-19 with another date 7 days from it, do the same with 2024-01-26, then run it, after that, download the query result.

Do the same with collection clickhistory but use this command

{$and: [{ click_time: { $gte: ISODate('2024-01-19T06:00:00.000Z') }},{ click_time: { $lte: ISODate('2024-01-26T05:59:59.000Z') }}]}

Tips & Tricks

I need to find all of the messages that were sent to a phone number

In SMP’s user interface, navigate to SMS History, in the Search field, enter the phone number and it will pull up all of the messages that were sent to it.

I need to unsubscribe certain phone numbers

Send a post request to https://api.smsmanagerpro.com/unsubphone with a field call “phones” and all of the phone numbers can be separated by a comma

I want to reconcile to make sure the vendor’s bill matches with ours

Login into SMP, go to Dashboard, right under the Expenses Accounting, select the Start Date and End Date, then click View, it should show the total messages and the cost
A good rule of thumb is our number will be a bit higher than the vendor’s bill, not by a lots but typically higher because our system’s timezone runs on UTC, we also count the help and opt-out confirmation messages while some vendors don’t and some do.

Kannel does not want to start

Login to http://192.168.25.93:9004/ and stop everything. Then ssh into the kannel server, and issue these commands
Sudo -s
Cd /usr/local/kannel/sbin
Rm -f kannel.store kannel.store.bak smsbox.stdout.log* access.log smpp_dave.log smsbox.log
After that, go back to the interface and start up the processes in this order
Bearerbox:bearbox_0
KNL_SQLBOX
KNL_SMSBOX

Our VPN is offline

This is a major issue, and the only way to resolve this is to open a ticket to Tier.net to restart the firewall box. When opening the ticket, make sure the Priority is High, use the following template:
Could you do me a favor and do a power cycle reboot for the sonicwall TZ400.
Location: ET4, Middle shelf on top of the TZ270
Device: Sonicwall TZ400
Instruction: just unplug the power cable, wait for 15s and plug it back in.
Once the VPN is back up, everything should automatically reconnect

How to check the sim route?

When you login into the sim route interface, navigate to Gateway Setting -> AT Command. Any port that has a red dot next to it will need to restart, so in the same page, select the port number in the drop down under Module Operations, click restart, and it will restart and bring the port back up (sort of like restarting your phone to pick up the signal again).


How to relaunch a mongos process

SSH into the server and sudo -s to switch to root (we need to be root to increase the max open file limit). Then issue the command
ulimit -n 100000 && sudo rm -f /home/mongodbstorage/mongos.log && sudo touch /home/mongodbstorage/mongos.log && sudo chown -R mongod:mongod /home/mongodbstorage/mongos.log && /usr/bin/mongos --config /etc/mongos.conf

What happen if gearmand or memcached goes down?

This rarely happens, the only way it could happen are:
  • gearmand maxed out its memory capacity, should be in the millions in the queue for this to happen
  • memcached maxed out its capacity.
  • The host server goes down or reboot for whatever reason and the VMs migrated automatically to another host server the way proxmox was designed to do
Steps to follow when this happens
  • Wait for the VM to come back up, or login to proxmox to bring the server back up
  • Memcached will automatically relaunch memcached server when comes online, but for gearmand, SSH to the server and issue this command as root
ulimit -n 100000 && gearmand --listen 192.168.25.61 --port 8888 --log-file /var/log/gearmand.log -d
  • After that, SSH to the workers 1 - 4 and issue these 2 commands to reconnect to the gearmand and memcached servers
Pkill supervisord
rm -f /tmp/drop* && ulimit -n 100000 && supervisord -c /var/www/html/etc/supervisord.conf