in DevOps

Query partially completed on the master (error on master: 1053)

So, I walked into the office this morning to the following lovely alert from our MySQL Enterprise Monitor installation (MEM) which, thankfully, was working again after I had to rebuild it last week.

Server: seawasp:3306
Time: 2009-07-08 01:33:28 GMT
Advisor: Replication – Slave Has Experienced A Replication Error

Problem Description
When a slave receives updates from its master it must apply those updates locally so the data on the slave matches that on the server. If an error occurs while applying an update on a slave, the data on the slave may not match that on the master and it is an indication that replication may be broken.

Advice
Investigate why the following error occurred on slave seawasp:3306. Look in the slaves error log for this and other errors around the time of this alert.

Last_Errno: 1053
Last_Error: Query partially completed on the master (error on master: 1053) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: ALTER TABLE ‘foo_student’ ADD ‘full_name’ VARCHAR(200) NULL DEFAULT NULL AFTER ‘surname_mother’

That’s. Just. Awesome.

So, let’s start from the beginning. What is error 1053?  According to the manual this is ER_SERVER_SHUTDOWN.  Not very helpful, is that?  What this really means is that the something bad happened and the slave chose to stop talking to the master.  In this particular case, a client appeared to issue an ALTER TABLE statement on a table in the master.  This statement got replicated to the slave.  Before the master could complete the query, it was aborted.  The slave noted this and aborted also, thinking there was some form of corruption on the master.

So, what to do?

I went and poked at the table definitions on both the master and slave to figure out what had happened.  It appears that the master had two extra columns in this table now, one of which was the aborted fullname column from above.  Because it appeared to have completed on the master, I went ahead and re-ran it on the slave.  No problems there.

Next, we have to get replication running again.  MEM suggests that to do this, we have to skip the ALTER TABLE query that was already replicated to the slave.  The MySQL docs on Troubleshooting Replication have the note:

If you decide that you can skip the next statement from the master, issue the following statements:

mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N;
mysql> START SLAVE;

The value of N should be 1 if the next statement from the master does not use AUTO_INCREMENT or LAST_INSERT_ID(). Otherwise, the value should be 2. The reason for using a value of 2 for statements that use AUTO_INCREMENT or LAST_INSERT_ID() is that they take two events in the binary log of the master.

So, in my case, after issuing the ALTER TABLE, I felt it safe enough to skip that in the relay log and restart the slave.  This query wasn’t dealing with an auto incrementing value, so I chose 1 for the skip value.  That note above about the 1 vs 2 is a good thing to heed if you’re ever needing to do this.

As soon as I started the slave, the relay log began draining of all of last night’s updates.  It’s important to note that replication never really stopped in this case.  The master was successfully sending updates to the slave, as far as I can tell.  The slave just wasn’t applying those updates to it’s local copy because it felt there was some sort of corruption.

The last thing to answer is:  what caused the original ALTER TABLE to fail?  No idea at this point. I sent an email to the database owner to try and figure that out, but I haven’t heard anything back yet.

Travis Campbell
Staff Systems Engineer at ghostar
Travis Campbell is a seasoned Linux Systems Engineer with nearly two decades of experience, ranging from dozens to tens of thousands of systems in the semiconductor industry, higher education, and high volume sites on the web. His current focus is on High Performance Computing, Big Data environments, and large scale web architectures.
  1. Hi,
    The problem you faced is that the “ALTER TABLE ‘foo_student’ …” query is not completed in the master. To ensure this, if you convert the bin log of the master, you could find the error number 1053 in errno variable of this particular query.
    When the slave reads this query from master and sees the error number withe query and feels that something was happened to the master when executing this query. So the slave would stop at this position. You only have to decide then whether to execute this query in slave or not and resume the slave.
    you got it?

    ars

Comments are closed.