啟動Heartbeat
1.啟動主節(jié)點(diǎn)的Heartbeat
Heartbeat安裝完成后,自動在/etc/init.d目錄下生成了啟動腳步文件heartbeat,直接輸入/etc/init.d/heartbeat可以看到heartbeat腳本的用法,如下所示:
[root@node1 ~]# /etc/init.d/heartbeat
Usage: /etc/init.d/heartbeat {start|stop|status|restart|reload|force-reload}
因而啟動heartbeat可以通過如下命令進(jìn)行:
[root@node1 ~]#service heartbeat start
或者通過
[root@node1 ~]#/etc/init.d/heartbeat start
這樣就啟動了主節(jié)點(diǎn)的heartbeat服務(wù),為了讓heartbeat能在開機(jī)自動運(yùn)行以及關(guān)機(jī)自動關(guān)閉,可以手動創(chuàng)建以下軟連接:
[root@node1 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc0.d/K05heartbeat
[root@node1 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc3.d/S75heartbeat
[root@node1 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc5.d/S75heartbeat
[root@node1 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc6.d/K05heartbeat
Heartbeat啟動時,通過“tail –f /var/log/ messages”查看主節(jié)點(diǎn)系統(tǒng)日志信息,輸出如下:
[root@node1 ~]# tail -f /var/log/messages
Nov 26 07:52:21 node1 heartbeat: [3688]: info: Configuration validated. Starting heartbeat 2.0.8
Nov 26 07:52:21 node1 heartbeat: [3689]: info: heartbeat: version 2.0.8
Nov 26 07:52:21 node1 heartbeat: [3689]: info: Heartbeat generation: 3
Nov 26 07:52:21 node1 heartbeat: [3689]: info: G_main_add_TriggerHandler: Added signal manual handler
Nov 26 07:52:21 node1 heartbeat: [3689]: info: G_main_add_TriggerHandler: Added signal manual handler
Nov 26 07:52:21 node1 heartbeat: [3689]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1
Nov 26 07:52:21 node1 heartbeat: [3689]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1
Nov 26 07:52:21 node1 heartbeat: [3689]: info: glib: ping heartbeat started.
Nov 26 07:52:21 node1 heartbeat: [3689]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Nov 26 07:52:21 node1 heartbeat: [3689]: info: Local status now set to: 'up'
Nov 26 07:52:22 node1 heartbeat: [3689]: info: Link node1:eth1 up.
Nov 26 07:52:23 node1 heartbeat: [3689]: info: Link 192.168.60.1:192.168.60.1 up.
Nov 26 07:52:23 node1 heartbeat: [3689]: info: Status update for node 192.168.60.1: status ping
此段日志是Heartbeat在進(jìn)行初始化配置,例如,heartbeat的心跳時間間隔、UDP廣播端口、ping節(jié)點(diǎn)的運(yùn)行狀態(tài)等,日志信息到這里會暫停,等待120秒之后,heartbeat會繼續(xù)輸出日志,而這個120秒剛好是ha.cf中“initdead”選項(xiàng)的設(shè)定時間。此時heartbeat的輸出信息如下:
Nov 26 07:54:22 node1 heartbeat: [3689]: WARN: node node2: is dead
Nov 26 07:54:22 node1 heartbeat: [3689]: info: Comm_now_up(): updating status to active
Nov 26 07:54:22 node1 heartbeat: [3689]: info: Local status now set to: 'active'
Nov 26 07:54:22 node1 heartbeat: [3689]: info: Starting child client "/usr/lib/heartbeat/ipfail" (694,694)
Nov 26 07:54:22 node1 heartbeat: [3689]: WARN: No STONITH device configured.
Nov 26 07:54:22 node1 heartbeat: [3689]: WARN: Shared disks are not protected.
Nov 26 07:54:22 node1 heartbeat: [3689]: info: Resources being acquired from node2.
Nov 26 07:54:22 node1 heartbeat: [3712]: info: Starting "/usr/lib/heartbeat/ipfail" as uid 694 gid 694 (pid 3712)
在上面這段日志中,由于node2還沒有啟動,所以會給出“node2: is dead”的警告信息,接下來啟動了heartbeat插件ipfail,由于我們在ha.cf文件中沒有配置STONITH,所以日志里也給出了“No STONITH device configured”的警告提示。
繼續(xù)看下面的日志:
Nov 26 07:54:23 node1 harc[3713]: info: Running /etc/ha.d/rc.d/status status
Nov 26 07:54:23 node1 mach_down[3735]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
Nov 26 07:54:23 node1 mach_down[3735]: info: mach_down takeover complete for node node2.
Nov 26 07:54:23 node1 heartbeat: [3689]: info: mach_down takeover complete.
Nov 26 07:54:23 node1 heartbeat: [3689]: info: Initial resource acquisition complete (mach_down)
Nov 26 07:54:24 node1 IPaddr[3768]: INFO: Resource is stopped
Nov 26 07:54:24 node1 heartbeat: [3714]: info: Local Resource acquisition completed.
Nov 26 07:54:24 node1 harc[3815]: info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
Nov 26 07:54:24 node1 ip-request-resp[3815]: received ip-request-resp 192.168.60.200/24/eth0 OK yes
Nov 26 07:54:24 node1 ResourceManager[3830]: info: Acquiring resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3
Nov 26 07:54:24 node1 IPaddr[3854]: INFO: Resource is stopped
Nov 26 07:54:25 node1 ResourceManager[3830]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start
Nov 26 07:54:25 node1 IPaddr[3932]: INFO: Using calculated netmask for 192.168.60.200: 255.255.255.0
Nov 26 07:54:25 node1 IPaddr[3932]: DEBUG: Using calculated broadcast for 192.168.60.200: 192.168.60.255
Nov 26 07:54:25 node1 IPaddr[3932]: INFO: eval /sbin/ifconfig eth0:0 192.168.60.200 netmask 255.255.255.0 broadcast 192.168.60.255
Nov 26 07:54:25 node1 avahi-daemon[1854]: Registering new address record for 192.168.60.200 on eth0.
Nov 26 07:54:25 node1 IPaddr[3932]: DEBUG: Sending Gratuitous Arp for 192.168.60.200 on eth0:0 [eth0]
Nov 26 07:54:26 node1 IPaddr[3911]: INFO: Success
Nov 26 07:54:26 node1 Filesystem[4021]: INFO: Resource is stopped
Nov 26 07:54:26 node1 ResourceManager[3830]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 start
Nov 26 07:54:26 node1 Filesystem[4062]: INFO: Running start for /dev/sdb5 on /webdata
Nov 26 07:54:26 node1 kernel: kjournald starting. Commit interval 5 seconds
Nov 26 07:54:26 node1 kernel: EXT3 FS on sdb5, internal journal
Nov 26 07:54:26 node1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Nov 26 07:54:26 node1 Filesystem[4059]: INFO: Success
Nov 26 07:54:33 node1 heartbeat: [3689]: info: Local Resource acquisition completed. (none)
Nov 26 07:54:33 node1 heartbeat: [3689]: info: local resource transition completed
上面這段日志是進(jìn)行資源的監(jiān)控和接管,主要完成haresources文件中的設(shè)置,在這里是啟用集群虛擬IP和掛載磁盤分區(qū)。
此時,通過ifconfig命令查看主節(jié)點(diǎn)的網(wǎng)絡(luò)配置,可以看到,主節(jié)點(diǎn)將自動綁定集群IP地址,在HA集群之外的主機(jī)上通過ping命令檢測集群IP地址192.168.60.200,已經(jīng)處于可通狀態(tài),也就是該地址變得可用。
同時查看磁盤分區(qū)的掛載情況,共享磁盤分區(qū)/dev/sdb5已經(jīng)被自動掛載。
2.啟動備份節(jié)點(diǎn)的Heartbeat
啟動備份節(jié)點(diǎn)的Heartbeat,與主節(jié)點(diǎn)方法一樣,使用如下命令:
[root@node2 ~]#/etc/init.d/heartbeat start
或者執(zhí)行
[root@node2 ~]#service heartbeat start
這樣就啟動了備用節(jié)點(diǎn)的heartbeat服務(wù),為了讓heartbeat能在開機(jī)自動運(yùn)行以及關(guān)機(jī)自動關(guān)閉, 創(chuàng)建以下軟連接:
[root@node2 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc0.d/K05heartbeat
[root@node2 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc3.d/S75heartbeat
[root@node2 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc5.d/S75heartbeat
[root@node2 ~]#ln -s /etc/init.d/heartbeat /etc/rc.d/rc6.d/K05heartbeat
備用節(jié)點(diǎn)的heartbeat日志輸出信息與主節(jié)點(diǎn)相對應(yīng),通過“tail -f /var/log/messages”可以看到如下輸出:
Nov 26 07:57:15 node2 heartbeat: [2110]: info: Link node1:eth1 up.
Nov 26 07:57:15 node2 heartbeat: [2110]: info: Status update for node node1: status active
Nov 26 07:57:15 node2 heartbeat: [2110]: info: Link node1:eth0 up.
Nov 26 07:57:15 node2 harc[2123]: info: Running /etc/ha.d/rc.d/status status
Nov 26 07:57:15 node2 heartbeat: [2110]: info: Comm_now_up(): updating status to active
Nov 26 07:57:15 node2 heartbeat: [2110]: info: Local status now set to: 'active'
Nov 26 07:57:15 node2 heartbeat: [2110]: info: Starting child client "/usr/lib/heartbeat/ipfail" (694,694)
Nov 26 07:57:15 node2 heartbeat: [2110]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 70 ms (> 50 ms) (GSource: 0x8f62080)
Nov 26 07:57:15 node2 heartbeat: [2134]: info: Starting "/usr/lib/heartbeat/ipfail" as uid 694 gid 694 (pid 2134)
備份節(jié)點(diǎn)檢測到node1處于活動狀態(tài),沒有可以接管的資源,因此,僅僅啟動了網(wǎng)絡(luò)監(jiān)聽插件ipfail,監(jiān)控主節(jié)點(diǎn)的心跳。
測試heartbeat
如何才能得知HA集群是否正常工作,測試是個不錯的方法,在把Heartbeat高可用性集群放到生產(chǎn)環(huán)境中之前,需要做如下五個步驟的測試,從而確定HA是否正常工作。
1.正常關(guān)閉和重啟主節(jié)點(diǎn)的heartbeat
首先在主節(jié)點(diǎn)node1上執(zhí)行“service heartbeat stop”正常關(guān)閉主節(jié)點(diǎn)的Heartbeat進(jìn)程,此時通過ifconfig命令查看主節(jié)點(diǎn)網(wǎng)卡信息,可以看到主節(jié)點(diǎn)已經(jīng)釋放了集群的服務(wù)IP地址,同時也釋放了掛載的共享磁盤分區(qū),然后查看備份節(jié)點(diǎn),現(xiàn)在備份節(jié)點(diǎn)已經(jīng)接管了集群的服務(wù)IP,同時也自動掛載上了共享的磁盤分區(qū)。
在這個過程中,使用ping命令對集群服務(wù)IP進(jìn)行測試,可以看到,集群IP一致處于可通狀態(tài),并沒有任何延時和阻塞,也就是說在正常關(guān)閉主節(jié)點(diǎn)的情況下,主備節(jié)點(diǎn)的切換是無縫的,HA對外提供的服務(wù)也可以不間斷運(yùn)行。
接著,將主節(jié)點(diǎn)heartbeat正常啟動,heartbeat啟動后,備份節(jié)點(diǎn)將自動釋放集群服務(wù)IP,同時卸載共享磁盤分區(qū),而主節(jié)點(diǎn)將再次接管集群服務(wù)IP和掛載共享磁盤分區(qū),其實(shí)備份節(jié)點(diǎn)釋放資源與主節(jié)點(diǎn)綁定資源是同步進(jìn)行的。因而,這個過程也是一個無縫切換。
2.在主節(jié)點(diǎn)上拔去網(wǎng)線
拔去主節(jié)點(diǎn)連接公共網(wǎng)絡(luò)的網(wǎng)線后,heartbeat插件ipfail通過ping測試可以立刻檢測到網(wǎng)絡(luò)連接失敗,接著自動釋放資源,而就在此時,備用節(jié)點(diǎn)的ipfail插件也會檢測到主節(jié)點(diǎn)出現(xiàn)網(wǎng)絡(luò)故障,在等待主節(jié)點(diǎn)釋放資源完畢后,備用節(jié)點(diǎn)馬上接管了集群資源,從而保證了網(wǎng)絡(luò)服務(wù)不間斷持續(xù)運(yùn)行。
同理,當(dāng)主節(jié)點(diǎn)網(wǎng)絡(luò)恢復(fù)正常時,由于設(shè)置了“auto_failback on”選項(xiàng),集群資源將自動從備用節(jié)點(diǎn)切會主節(jié)點(diǎn)。
在主節(jié)點(diǎn)拔去網(wǎng)線后日志信息如下,注意日志中的斜體部分:
Nov 26 09:04:09 node1 heartbeat: [3689]: info: Link node2:eth0 dead.
Nov 26 09:04:09 node1 heartbeat: [3689]: info: Link 192.168.60.1:192.168.60.1 dead.
Nov 26 09:04:09 node1 ipfail: [3712]: info: Status update: Node 192.168.60.1 now has status dead
Nov 26 09:04:09 node1 harc[4279]: info: Running /etc/ha.d/rc.d/status status
Nov 26 09:04:10 node1 ipfail: [3712]: info: NS: We are dead. :<
Nov 26 09:04:10 node1 ipfail: [3712]: info: Link Status update: Link node2/eth0 now has status dead
…… 中間部分省略 ……
Nov 26 09:04:20 node1 heartbeat: [3689]: info: node1 wants to go standby [all]
Nov 26 09:04:20 node1 heartbeat: [3689]: info: standby: node2 can take our all resources
Nov 26 09:04:20 node1 heartbeat: [4295]: info: give up all HA resources (standby).
Nov 26 09:04:21 node1 ResourceManager[4305]: info: Releasing resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3
Nov 26 09:04:21 node1 ResourceManager[4305]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 stop
Nov 26 09:04:21 node1 Filesystem[4343]: INFO: Running stop for /dev/sdb5 on /webdata
Nov 26 09:04:21 node1 Filesystem[4343]: INFO: Trying to unmount /webdata
Nov 26 09:04:21 node1 Filesystem[4343]: INFO: unmounted /webdata successfully
Nov 26 09:04:21 node1 Filesystem[4340]: INFO: Success
Nov 26 09:04:22 node1 ResourceManager[4305]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 stop
Nov 26 09:04:22 node1 IPaddr[4428]: INFO: /sbin/ifconfig eth0:0 192.168.60.200 down
Nov 26 09:04:22 node1 avahi-daemon[1854]: Withdrawing address record for 192.168.60.200 on eth0.
Nov 26 09:04:22 node1 IPaddr[4407]: INFO: Success
備用節(jié)點(diǎn)在接管主節(jié)點(diǎn)資源時的日志信息如下:
Nov 26 09:02:58 node2 heartbeat: [2110]: info: Link node1:eth0 dead.
Nov 26 09:02:58 node2 ipfail: [2134]: info: Link Status update: Link node1/eth0 now has status dead
Nov 26 09:02:59 node2 ipfail: [2134]: info: Asking other side for ping node count.
Nov 26 09:02:59 node2 ipfail: [2134]: info: Checking remote count of ping nodes.
Nov 26 09:03:02 node2 ipfail: [2134]: info: Telling other node that we have more visible ping nodes.
Nov 26 09:03:09 node2 heartbeat: [2110]: info: node1 wants to go standby [all]
Nov 26 09:03:10 node2 heartbeat: [2110]: info: standby: acquire [all] resources from node1
Nov 26 09:03:10 node2 heartbeat: [2281]: info: acquire all HA resources (standby).
Nov 26 09:03:10 node2 ResourceManager[2291]: info: Acquiring resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3
Nov 26 09:03:10 node2 IPaddr[2315]: INFO: Resource is stopped
Nov 26 09:03:11 node2 ResourceManager[2291]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start
Nov 26 09:03:11 node2 IPaddr[2393]: INFO: Using calculated netmask for 192.168.60.200: 255.255.255.0
Nov 26 09:03:11 node2 IPaddr[2393]: DEBUG: Using calculated broadcast for 192.168.60.200: 192.168.60.255
Nov 26 09:03:11 node2 IPaddr[2393]: INFO: eval /sbin/ifconfig eth0:0 192.168.60.200 netmask 255.255.255.0 broadcast 192.168.60.255
Nov 26 09:03:12 node2 avahi-daemon[1844]: Registering new address record for 192.168.60.200 on eth0.
Nov 26 09:03:12 node2 IPaddr[2393]: DEBUG: Sending Gratuitous Arp for 192.168.60.200 on eth0:0 [eth0]
Nov 26 09:03:12 node2 IPaddr[2372]: INFO: Success
Nov 26 09:03:12 node2 Filesystem[2482]: INFO: Resource is stopped
Nov 26 09:03:12 node2 ResourceManager[2291]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 start
Nov 26 09:03:13 node2 Filesystem[2523]: INFO: Running start for /dev/sdb5 on /webdata
Nov 26 09:03:13 node2 kernel: kjournald starting. Commit interval 5 seconds
Nov 26 09:03:13 node2 kernel: EXT3 FS on sdb5, internal journal
Nov 26 09:03:13 node2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Nov 26 09:03:13 node2 Filesystem[2520]: INFO: Success
3.在主節(jié)點(diǎn)上拔去電源線
在主節(jié)點(diǎn)拔去電源后,備用節(jié)點(diǎn)的heartbeat進(jìn)程會立刻收到主節(jié)點(diǎn)已經(jīng)shutdown的消息,如果在集群上配置了Stonith設(shè)備,那么備用節(jié)點(diǎn)將會把電源關(guān)閉或者復(fù)位到主節(jié)點(diǎn),當(dāng)Stonith設(shè)備完成所有操作時,備份節(jié)點(diǎn)才拿到接管主節(jié)點(diǎn)資源的所有權(quán),從而接管主節(jié)點(diǎn)的資源。
在主節(jié)點(diǎn)拔去電源后,備份節(jié)點(diǎn)有類似如下的日志輸出:
Nov 26 09:24:54 node2 heartbeat: [2110]: info: Received shutdown notice from 'node1'.
Nov 26 09:24:54 node2 heartbeat: [2110]: info: Resources being acquired from node1.
Nov 26 09:24:54 node2 heartbeat: [2712]: info: acquire local HA resources (standby).
Nov 26 09:24:55 node2 ResourceManager[2762]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start
Nov 26 09:24:57 node2 ResourceManager[2762]: info: Running /etc/ha.d/resource.d/Filesystem /dev/sdb5 /webdata ext3 start
4.切斷主節(jié)點(diǎn)的所有網(wǎng)絡(luò)連接
在主節(jié)點(diǎn)上斷開心跳線后,主備節(jié)點(diǎn)都會在日志中輸出“eth1 dead”的信息,但是不會引起節(jié)點(diǎn)間的資源切換,如果再次拔掉主節(jié)點(diǎn)連接公共網(wǎng)絡(luò)的網(wǎng)線,那么就會發(fā)生主備節(jié)點(diǎn)資源切換,資源從主節(jié)點(diǎn)轉(zhuǎn)移到備用節(jié)點(diǎn),此時,連上主節(jié)點(diǎn)的心跳線,觀察系統(tǒng)日志,可以看到,備用節(jié)點(diǎn)的heartbeat進(jìn)程將會重新啟動,進(jìn)而再次控制集群資源,最后,連上主節(jié)點(diǎn)的對外網(wǎng)線,集群資源再次從備用節(jié)點(diǎn)轉(zhuǎn)移到主節(jié)點(diǎn),這就是整個的切換過程。
5.在主節(jié)點(diǎn)上非正常關(guān)閉heartbeat守護(hù)進(jìn)程
在主節(jié)點(diǎn)上通過“killall -9 heartbeat”命令關(guān)閉heartbeat進(jìn)程,由于是非法關(guān)閉heartbeat進(jìn)程,因此heartbeat所控制的資源并沒有釋放,備份節(jié)點(diǎn)在很短一段時間沒有收到主節(jié)點(diǎn)的響應(yīng)后,就會認(rèn)為主節(jié)點(diǎn)出現(xiàn)故障,進(jìn)而接管主節(jié)點(diǎn)資源,在這種情況下,就出現(xiàn)了資源爭用情況,兩個節(jié)點(diǎn)都占用一個資源,造成數(shù)據(jù)沖突。針對這個情況,可以通過linux提供的內(nèi)核監(jiān)控模塊watchdog來解決這個問題,將watchdog集成到heartbeat中,如果heartbeat異常終止,或者系統(tǒng)出現(xiàn)故障,watchdog都會自動重啟系統(tǒng),從而釋放集群資源,避免了數(shù)據(jù)沖突的發(fā)生。
本章節(jié)我們沒有配置watchdog到集群中,如果配置了watchdog,在執(zhí)行“killall -9 heartbeat”時,會在/var/log/messages中看到如下信息:
Softdog: WDT device closed unexpectedly. WDT will not stop!
這個錯誤告訴我們,系統(tǒng)出現(xiàn)問題,將重新啟動。
本站僅提供存儲服務(wù),所有內(nèi)容均由用戶發(fā)布,如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請
點(diǎn)擊舉報。