cancel
Showing results for 
Search instead for 
Did you mean: 

Rampage IV Extreme RAID 5 failure

DeJuanNOnley
Level 7
I recently set up a 3-drive RAID 5 array (brand new WD Caviar Black 2 TB drives) on my Rampage IV Extreme mobo (not Black edition). And it fails on every boot, showing that the 1st and 3rd drives are not part of the array. I've tried re-creating it to no avail. I boot to a stand-alone SSD on the ASMedia controller. Once in Windows 8.1 I can launch the Intel RSTe software and force it to normal, initialize and verify (takes a few days to complete of course) and everything passes. I notice that at least as of this latest BIOS, the controller defaults to RSTe mode for RAID, but some others have suggested iRST may be preferable. Though I have yet to see a good argument as to why. The ASUS website however only offers the driver and software for iRST mode, despite the motherboard defaulting to RSTe. So I downloaded the driver and software from Intel. Is it possible changing to iRST would fix this problem? Anyone ever run into something similar?

Specs:
ASUS Rampage IV Extreme
i7 3930k CPU
8x4GB G.Skill DDR3-2400 RAM

Edit: After waiting for a response for a couple of days I went ahead and tried changing to iRST mode (after backing up my data of course). Then I booted into Safe Mode and forcibly installed the drivers from the ASUS website (Intel C600/C220 series chipset SATA RAID). I then deleted and re-created the array from the BIOS setup (not software) using Ctrl-I. It showed up as "normal" in post at this point for the first 2-3 boots but then again shows as failed with the same two drives not being members of the array. Interestingly, I can no longer install the management software from either the ASUS or Intel websites. They both result in installation error messages talking about how my system does not meet the requirement for installation (in RSTe mode I was able to install the software from Intel but not ASUS). I can still RUN the Intel management software from the zip file off of the ASUS website and do the same as I did in RSTe mode (force to normal, initialize, verify). Could both of these brand new drives be bad? Sure. It is possible. But when I force the state to normal, initialize, and then verify it always comes back with zero errors and everything works just fine. And none of the drives ever show as failed. It is only the array itself that shows as failed with a generic message about the first and third drive not being array members. 20+ years in IT and I've never seen anything like this, though to be fair I've mainly used Adaptec controllers.
5,166 Views
7 REPLIES 7

Nodens
Level 16
Have these drives been members of arrays on other controllers? This sounds like you have leftover metadata on your drives that "confuse" the Intel controller. Since you're an IT guy the solution I'm going to suggest should be easy to pull. Boot a live linux cd with dmraid support and do a "dmraid -r -E /dev/sda" where /dev/sda the drive you need to clear of leftover metadata.
RAMPAGE Windows 8/7 UEFI Installation Guide - Patched OROM for TRIM in RAID - Patched UEFI GOP Updater Tool - ASUS OEM License Restorer
There are 10 types of people in the world. Those who understand binary and those who don't!

RealBench Developer.

Good suggestion but it didn't work. I didn't think it would since they are new drives, but I tried it with my Gparted Live CD and the same problem occurred (I even deleted the array first, then recreated it after). Wiped the metadata from all three drives, even the one that never shows as a non-member. I do have some more details on the issue that I'm noticing.

On the first few boots everything works normally. I create partitions, copy data, reboot, etc. And even during POST the Intel controller shows the array status as "normal." After I install the monitoring software in Win8 (which as I mentioned refuses to install in iRST mode at all but since the default for the chipset is RSTe anyway I'm sticking to that since it makes no difference as far as this failure issue goes)... that's when my problems start. Since ASUS doesn't provide the RSTe drivers (even though that's the default mode) I had to get them from Intel. I'm wondering if downgrading them to an earlier version may help.

But it gets even more weird. On the next boot after installing the Intel RSTe management software, the Intel controller shows the array as "failed" and the first and third drives as non-members. If I allow the system to boot in this state (to my SSD on the ASMedia controller) the array is offline and I have to fire up the software and force the state to normal. BUT if I hit Ctrl-I and enter setup during POST, it shows the array state as NORMAL, even though just one second earlier it said otherwise! If I simply exit at this point and allow the system to continue to boot then everything is fine.

I've seen similar things with other RAID controllers and there is usually an option to delay the inital boot process to allow the drives to spin up properly. However I don't see that with the Intel controller.

Edit: I'm going to try disabling Quick Boot to see if this helps. I don't like Quick Boot anyway. It is just the default for this mobo and it has never caused an issue on this system. It does seem weird though that every time this has happened it is only after running the Intel management software in Windows for the first time on the array. Coincidence? Maybe...

Edit 2: First couple boots went well. If I'm right this would also explain why it seems like every time I mess with the overclock settings it boots up fine the next time. This introduces a delay into the next boot. Can't believe I didn't think of this earlier. Sometimes the simplest explanations...

Nodens
Level 16
Ok let's see. What you describe sounds like the drives momentarily fail and the controller marks them as such. I will list several things that can cause this apart from drivers which I will cover at the end. You should check all of them.

1) Faulty SATA cables. If you have not tested with different cables do so.

2) Not enough juice from the PSU on boot. Mechanical drives need as much as 35-40W each during the spin up stage (depending on the drive). This is why the "Staggered spin up" option exists in real hardware controllers. It's not so they can spin up properly, it's so the drives don't spin up all at the same time causing a huge spike on the PSU. This is more important on Data Centers of course due to the amount of drives involved but still it can happen on normal systems with multiple drives if the PSU is going bad or simply is not up for that load. Overclocked systems with modern graphics cards due tend to pull a lot from the PSU. Add a few mechanical drives on such a system operating with a PSU closed to the limit and you got yourself a problem. The 12V rail is what feeds the spindle motor and that's the easiest one to starve out.

3) Are you connecting the drives via case hotbays? If so, try connecting them directly to the controller, removing the hotbays from the equation. 90% of case hotbays I've seen (even on expensive cases like the Haf-X I currently use for my dev system) are faulty causing frequent disconnects on the SATA channels. With the LSI cards I use, this is easy to diagnose as the disconnects show up on the logs. With onboard fakeraid it's impossible unless you bypass the hotbays. On the other hand most fakeraids won't complain about this problem and it will manifest as a performance drop. Unless it's severe in which case the symptoms are like what you describe.

Now regarding RSTe and RST:
1)There are some version or RSTe drivers that are quite problematic causing issues that range from dropouts to stop errors (I have diagnosed some of the later myself here from memory dumps provided by users). Unfortunately I can not recall the version numbers as it's been quite a while.
2) The plain RST won't install if the board is loading the RSTe Option ROM unless you use a driver with a modified .inf (hardware signature added) which will need Windows running in "Test Mode" or disabling the Driver Signature enforcement on every boot. Or you can use one of my patched UEFI versions (link in signature) to load the RST Option ROM. Some of the UEFI versions support loading the RST UEFI driver but you have to install Windows in UEFI mode and use the UEFI Driver instead of the Option ROM via the CSM settings in UEFI or by disabling the CSM module in the UEFI. The latest ones pack the RST Option ROM as well but you have to select in in UEFI prior to Windows installation. So what you do on this department depends on which UEFI version you are on/want to use. Use this information to make your choice:)

Regarding FastBoot:
This is entirely irrelevant. All FastBoot does is skip initializing certain components on POST (some of which are configurable in the UEFI) so that the system boots faster. The disk subsystem is certainly not one of them. And while you're on RAID the initialization of member drives is handled by the OpROM or the UEFI driver. As in it's entirely out of FastBoot's domain.

Regarding your last edit:
System instability can drop disks out of arrays. Make sure your system is stable. Also BCLK overclocking can have detrimental effect to your data as well as it can cause data corruption very very easily. Do not overclock BCLK unless you know exactly what you're doing and/or have no critical data.
RAMPAGE Windows 8/7 UEFI Installation Guide - Patched OROM for TRIM in RAID - Patched UEFI GOP Updater Tool - ASUS OEM License Restorer
There are 10 types of people in the world. Those who understand binary and those who don't!

RealBench Developer.

DeJuanNOnley
Level 7
All great suggestions. Unfortunately most of it doesn't apply.

1) Already tried other SATA cables.

2) Possible but unlikely. This is a Corsair 1200W active PFC 80 Plus Platinum PSU. If that isn't enough to power three hard drives, one SSD and a single video card I would be surprised.

3) No. The drives are connected directly to the SATA ports on the motherboard.

I have to agree on the FastBoot thing. It didn't seem likely to me either. As I said, I've never used onboard RAID before. And I am running stock BCLK. I am only overclocking via multiplier and VCORE voltage. Is it really possible this is an issue? I have 100% stability as confirmed by multiple stress tests. But just out of curiosity I'm willing to go back to stock settings just to see if that is the problem.

So... from what you are saying about RSTe, it sounds like maybe rolling back to an earlier version of the driver MAY solve the issue?

Frankly I'm to the point of just ordering an Adaptec controller and being done with it.

Nodens
Level 16
Yes trying a different driver version is a good idea. Unfortunately I can't recall the problematic driver version numbers but doing a search on this very forum will yield you some threads about them. Also yes do try it on stock just to rule out instability. Unfortunately stress tests mostly prove that you can pass said stress test. They're not 100% representing stability. Realbench's stress test comes close as it loads every available subsystem in contrast to running FFTs ad infinitum like most other stress tests are doing. Still the best stress test is actual usage of your system for its intended purpose. If it works properly, it works 😛

That said, Intel RAID, like all onboard codecs, is actually BIOS assisted software RAID. This means that there's an underlying hardware or driver issue at play here. I would go as far as to suggest to actually put the drives under Windows or Linux software RAID just to rule out the drives. Assuming you can reproduce this with stock clocks and different driver versions don't help.
RAMPAGE Windows 8/7 UEFI Installation Guide - Patched OROM for TRIM in RAID - Patched UEFI GOP Updater Tool - ASUS OEM License Restorer
There are 10 types of people in the world. Those who understand binary and those who don't!

RealBench Developer.

Thanks for all your help. But I've been rebooting dozens of times in a row (warm and cold) and it shows as Normal 100% of the time with Fast Boot disabled. With it enabled it shows up as failed unless I change some unrelated BIOS setting first, thus slowing down the next boot. And even when it shows as failed if I immediately hit Ctrl-i to enter the RAID setup it then shows as normal without changing anything. It definitely has something to do with the drives spinning up or something. I've seen this a handful of times with hardware RAID controllers though it has been a few years. It just didn't dawn on me I could have the same problem until I noticed the pattern of when it failed and when it didn't. Very rare issue but it can happen. Especially since these are not designed for RAID. I'm curious about checking the status of TLER and if it is enabled (it is on some Caviar drives) on the one drive that always passes. I believe that even with the newest Black series of drives you can manually enable it with the WDTLER utility. Which I know I downloaded somewhere a long time ago... In any case, if anyone else encounters this issue disabling Fast Boot may help. Though of course it is also best to use drives designed for RAID like the WD Red series. I just had these laying around that Western Digital sent me as RMA replacements.

Nodens
Level 16
a) If there's any issue with spin up and the PSU can handle the spike load (your Corsair one certainly can) then there's either an issue with the drive spindle motors or the PSU cables. There's no other possibility. Always assuming the issue can be reproduced at stock clocks and RAM is not faulty.

b) TLER is irrelevant entirely. TLER only matters on hardware controllers. Software RAID and thus fakeraid as well does not wait forever for the drive to mark the recovery attempt as failed like hardware controllers do. But even if it did, it's still unrelated to this as error recovery only pertains to read/write errors. Nothing to do with spindle motors or POST. In order for it to have any effect, there must be a read or write operation pending.

c) Caviar Black drives from the FAEX line and afterwards can not have the TLER bit toggled. The WDTLER utility does not work on them. Only old Black drives. The only current WD drives that support TLER are the Enterprise versions and the Red ones. I would stay clear off the Red line though as there's a reason they have 3 year warranty compared to the 5 years of Enterprise and Black drives. If you want cheap TLER enabled drives do what I do and buy SATA2 Enterprise drives. SATA2 to SATA3 on mechanical drives only affects the speed of the cache which with hardware controllers (with on board cache) you should disable anyhow 😉

Bottom line is though TLER only matters for hardware controllers.
RAMPAGE Windows 8/7 UEFI Installation Guide - Patched OROM for TRIM in RAID - Patched UEFI GOP Updater Tool - ASUS OEM License Restorer
There are 10 types of people in the world. Those who understand binary and those who don't!

RealBench Developer.