So, I walked into the office this morning to the following lovely alert from our MySQL Enterprise Monitor installation (MEM) which, thankfully, was working again after I had to rebuild it last week.
Time: 2009-07-08 01:33:28 GMT
Advisor: Replication – Slave Has Experienced A Replication Error
When a slave receives updates from its master it must apply those updates locally so the data on the slave matches that on the server. If an error occurs while applying an update on a slave, the data on the slave may not match that on the master and it is an indication that replication may be broken.
Investigate why the following error occurred on slave seawasp:3306. Look in the slaves error log for this and other errors around the time of this alert.
Last_Error: Query partially completed on the master (error on master: 1053) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: ALTER TABLE ‘foo_student’ ADD ‘full_name’ VARCHAR(200) NULL DEFAULT NULL AFTER ‘surname_mother’
That’s. Just. Awesome.
So, let’s start from the beginning. What is error 1053?Â According to the manual this is ER_SERVER_SHUTDOWN.Â Not very helpful, is that?Â What this really means is that the something bad happened and the slave chose to stop talking to the master.Â In this particular case, a client appeared to issue an ALTER TABLE statement on a table in the master.Â This statement got replicated to the slave.Â Before the master could complete the query, it was aborted.Â The slave noted this and aborted also, thinking there was some form of corruption on the master.
So, what to do?
I went and poked at the table definitions on both the master and slave to figure out what had happened.Â It appears that the master had two extra columns in this table now, one of which was the aborted fullname column from above.Â Because it appeared to have completed on the master, I went ahead and re-ran it on the slave.Â No problems there.
Next, we have to get replication running again.Â MEM suggests that to do this, we have to skip the ALTER TABLE query that was already replicated to the slave.Â The MySQL docs on Troubleshooting Replication have the note:
If you decide that you can skip the next statement from the master, issue the following statements:mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N; mysql> START SLAVE;
The value of N should be 1 if the next statement from the master does not use AUTO_INCREMENT or LAST_INSERT_ID(). Otherwise, the value should be 2. The reason for using a value of 2 for statements that use AUTO_INCREMENT or LAST_INSERT_ID() is that they take two events in the binary log of the master.
So, in my case, after issuing the ALTER TABLE, I felt it safe enough to skip that in the relay log and restart the slave.Â This query wasn’t dealing with an auto incrementing value, so I chose 1 for the skip value.Â That note above about the 1 vs 2 is a good thing to heed if you’re ever needing to do this.
As soon as I started the slave, the relay log began draining of all of last night’s updates.Â It’s important to note that replication never really stopped in this case.Â The master was successfully sending updates to the slave, as far as I can tell.Â The slave just wasn’t applying those updates to it’s local copy because it felt there was some sort of corruption.
The last thing to answer is:Â what caused the original ALTER TABLE to fail?Â No idea at this point. I sent an email to the database owner to try and figure that out, but I haven’t heard anything back yet.