MySQL Replication Repairing

MySQL replication stopped due to a disk full situation on MySQL server. Here are notes on how I repaired it.

Due to /opt being full on DB master, the replication crashed. Symptoms, on DB slave:

PCUDSSW1513> sudo tail -100 /var/log/mysqld.log
120524  9:43:48 [ERROR] Got fatal error 1236: 'binlog truncated in the middle of event' from master when reading data from binary log
120524  9:43:48 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000063', position 399760643

Checked the given binary log file on DB master:

PCUDSSX1501 2 /opt/mysql-data> sudo -u mysql mysqlbinlog --start-position 399760643 'mysql-bin.000063'
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 14597, event_type: 2
Could not read entry at offset 399760643:Error in log format or read error
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Similar troubles for some positions before that. So looked at the whole file:

PCUDSSX1501 2 /opt/mysql-data> sudo -u mysql mysqlbinlog 'mysql-bin.000063' > /opt/simko/zzz.sql
ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 14597, event_type: 2
Could not read entry at offset 399760643:Error in log format or read error

and:

PCUDSSX1501 2 /opt/mysql-data> tail -10 /opt/simko/zzz.sql
/*!*/;
# at 399760393
#120523 22:22:40 server id 201101501  end_log_pos 399760643     Query   thread_id=77556 exec_time=0     error_code=0
SET TIMESTAMP=1337804560/*!*/;
INSERT INTO rnkDOWNLOADS (id_bibrec,id_bibdoc,file_version,file_format,id_user,client_host,download_time) VALUES (1157741,94200,'','GIF;ICON',0,INET_ATON('24.222.171.86'),NOW())
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

Checked whether this data exists on DB slave:

mysql> SELECT * FROM rnkDOWNLOADS WHERE id_bibrec=1157741 AND id_bibdoc=94200 AND client_host=INET_ATON('24.222.171.86');
+-----------+---------------------+-------------+---------+-----------+--------------+-------------+---------+------------------+
| id_bibrec | download_time       | client_host | id_user | id_bibdoc | file_version | file_format | referer | display_position |
+-----------+---------------------+-------------+---------+-----------+--------------+-------------+---------+------------------+
|   1157741 | 2012-05-23 22:22:40 |   417246038 |       0 |     94200 |            0 | GIF;ICON    | NULL    |                0 |
+-----------+---------------------+-------------+---------+-----------+--------------+-------------+---------+------------------+
1 row in set (0.87 sec)

Seems it does.

Indeed, comparing end_log_pos value of 399760643 in zzz.sql file above exactly corresponds to DB slave's position of 399760643:

mysql> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 137.138.198.204
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000063
        Read_Master_Log_Pos: 399760643
             Relay_Log_File: mysqld-relay-bin.000191
              Relay_Log_Pos: 235
      Relay_Master_Log_File: mysql-bin.000063
        Exec_Master_Log_Pos: 399760643

Hence it seems that everything that DB master had in the binary log file before the crash was correctly replicated on the DB slave.

Consequently let's try to start replication anew by forcing the next available position. Not the best, but may work. (See also P.S. below.)

The start of the next DB master binary log file looks like:

(sudo -u mysql mysqlbinlog 'mysql-bin.000064' | head -15)
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#120523 23:54:51 server id 201101501  end_log_pos 98    Start: binlog v 4, server v 5.0.95-log created 120523 23:54:51 at startup
ROLLBACK/*!*/;
# at 98
#120523 23:54:53 server id 201101501  end_log_pos 345   Query   thread_id=5     exec_time=0     error_code=0
use cdsweb/*!*/;
SET TIMESTAMP=1337810093/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1/*!*/;
SET @@session.sql_mode=0/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;

Let's try to restart DB replication from file 64 position 98 by executing (on DB slave):

mysql> STOP SLAVE;
mysql> CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000064',MASTER_LOG_POS=98;
mysql> START SLAVE;

This restarted the replication.

Since almost a week elapsed since DB master crashed, the DB slave is quite behind:

mysql> SHOW SLAVE STATUS\G;
Seconds_Behind_Master: 560616

But the replication is running again, catching up with the past. Let's see how much time it will take for the slave to catch up.

P.S. For additional safety, one may not rely on such truncated binary log file analysis, but rather restart DB replication from scratch anew, when service time permits.