PDA

View Full Version : G73JH - BSOD's (Random & in SWTOR)



clamm
01-02-2012, 09:07 PM
I am hoping the Asus Gods or Chastity :) can provide some guidance. I have a G73JH, under warranty, that is randomly having BSOD as well as every time I log into Star Wars the Old Republic. Some history and information:

Remedies Attempted
1. Updated Bios to 213
2. Updated Vbios to most recent
3. Run Memtest86 - no failures 20 passes
4. Updated all firmware to most recent
5. Updated all drivers to most recent (including ATI 5870)
6. Clean install of Win 7
7. Clean install of all game clients
8. Secondary install on alternate drive (testing to see if its on HD or the other) - issue continues on either drive

Replicating the Issue
I can guarantee the crash while in SWTOR, but have it randomly in other applications (i.e. was running event viewer and got a crash). When it crashes, its always the same BSOD 124 Hardware Error Stop code (see below). I have run diagnostics on just about every aspect of the machine and find nothing. I have no overclocking, everything is default to factory.

The Constants
I have noticed that the temp on the GPU will max out at 85C, but run idle 48C, yet the card should handle that easily, right? CPU core temps are 46C. The only other constant between all system reloads is that the event viewer is spamming Event 17 WHEA-Logger(info below). I have researched and found where other users have had to RMA their machine because of this error alone and it being related to the ATI card or chipset, but have no way to verify that.

The Question
Is there anything I have not tried besides standing on my head while rubbing my tummy? Should I just RMA it at this point? Any guidance would be appreciated. Below is a sample of the crash data, and I can provide hwinfo or dxdiag report if you require.

Crash Data
Crash Dump Analysis
--------------------------------------------------------------------------------
On Mon 1/2/2012 3:13:29 PM GMT your computer crashed
crash dump file: C:\Windows\Minidump\010212-11419-01.dmp
This was probably caused by the following module: hal.dll (hal+0x12A3B)
Bugcheck code: 0x124 (0x4, 0xFFFFFA8007352038, 0x0, 0x0)
Error: WHEA_UNCORRECTABLE_ERROR
file path: C:\Windows\system32\hal.dll
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: Hardware Abstraction Layer DLL
Bug check description: This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
This is likely to be caused by a hardware problem problem. This problem might be caused by a thermal issue.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system which cannot be identified at this time.


On Mon 1/2/2012 2:48:37 PM GMT your computer crashed
crash dump file: C:\Windows\Minidump\010212-11481-01.dmp
This was probably caused by the following module: hal.dll (hal+0x12A3B)
Bugcheck code: 0x124 (0x4, 0xFFFFFA8007334038, 0x0, 0x0)
Error: WHEA_UNCORRECTABLE_ERROR
file path: C:\Windows\system32\hal.dll
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: Hardware Abstraction Layer DLL
Bug check description: This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
This is likely to be caused by a hardware problem problem. This problem might be caused by a thermal issue.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system which cannot be identified at this time.


On Mon 1/2/2012 1:07:25 PM GMT your computer crashed
crash dump file: C:\Windows\Minidump\010212-11294-01.dmp
This was probably caused by the following module: hal.dll (hal+0x12A3B)
Bugcheck code: 0x124 (0x4, 0xFFFFFA8007351038, 0x0, 0x0)
Error: WHEA_UNCORRECTABLE_ERROR
file path: C:\Windows\system32\hal.dll
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: Hardware Abstraction Layer DLL
Bug check description: This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
This is likely to be caused by a hardware problem problem. This problem might be caused by a thermal issue.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system which cannot be identified at this time.


On Mon 1/2/2012 12:59:49 PM GMT your computer crashed
crash dump file: C:\Windows\Minidump\010212-10155-01.dmp
This was probably caused by the following module: hal.dll (hal+0x12A3B)
Bugcheck code: 0x124 (0x4, 0xFFFFFA800732D038, 0x0, 0x0)
Error: WHEA_UNCORRECTABLE_ERROR
file path: C:\Windows\system32\hal.dll
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: Hardware Abstraction Layer DLL
Bug check description: This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
This is likely to be caused by a hardware problem problem. This problem might be caused by a thermal issue.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system which cannot be identified at this time.


On Mon 1/2/2012 12:59:49 PM GMT your computer crashed
crash dump file: C:\Windows\memory.dmp
This was probably caused by the following module: hal.dll (hal!HalBugCheckSystem+0x1E3)
Bugcheck code: 0x124 (0x4, 0xFFFFFA800732D038, 0x0, 0x0)
Error: WHEA_UNCORRECTABLE_ERROR
file path: C:\Windows\system32\hal.dll
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: Hardware Abstraction Layer DLL
Bug check description: This bug check indicates that a fatal hardware error has occurred. This bug check uses the error data that is provided by the Windows Hardware Error Architecture (WHEA).
This is likely to be caused by a hardware problem problem. This problem might be caused by a thermal issue.
The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system which cannot be identified at this time.

Thanks for your assistance, or declaring the inevitable.

fostert
01-02-2012, 09:25 PM
Do you have an overclocked system? In OC circles the BSOD codes mean something (see http://www.xtremesystems.org/forums/showthread.php?266589-The-OverClockers-BSOD-code-list). In the case of the 32nm SB chips, 0x124 usually means too little vcore, and/or QPI/VTT (uncore, or generally the integrated memory controller on the CPU).

I would try resetting the BIOS to defaults, either from the BIOS screen itself or maybe even by unplugging the battery and AC and holding the power button down for ~30 secs. Shot in the dark...

Hows the thermals? A high cpu temp can push a marginally stable system over the edge and cause a given voltage to be insufficient...try cooling the system down as far as possible, e.g. run it in a room with the windows wide open (if its winter!) and see if your BSODs go away...

clamm
01-02-2012, 09:32 PM
Too my knowledge, no, nothing is overclocked. How would I check?

Here is the thermal report:
Sensors @ 02.01.2012 14:55:40 ---------------------------------------------

[System]
Virtual Memory Commited 2187.000 MB
Virtual Memory Available 14043.000 MB
Virtual Memory Load 13.000 %
Physical Memory Used 1810.000 MB
Physical Memory Available 6306.000 MB
Physical Memory Load 22.000 %
[CPU #0]
Core #0 Clock 2921.6 MHz
Core #1 Clock 931.0 MHz
Core #2 Clock 2727.2 MHz
Core #3 Clock 931.0 MHz
Core #0 Thread #0 Usage 51.2 %
Core #0 Thread #1 Usage 0.0 %
Core #1 Thread #0 Usage 6.0 %
Core #1 Thread #1 Usage 0.0 %
Core #2 Thread #0 Usage 13.8 %
Core #2 Thread #1 Usage 0.0 %
Core #3 Thread #0 Usage 4.6 %
Core #3 Thread #1 Usage 0.3 %
Total CPU Usage 9.5 %
On-Demand Clock Modulation 100.0 %
[CPU Digital Thermal Sensor]
CPU#0 Core0 55.0 °C
CPU#0 Core1 50.0 °C
CPU#0 Core2 52.0 °C
CPU#0 Core3 47.0 °C
[ASUS G73 EC]
CPU 2117 RPM
GPU 2851 RPM
[Intel PCH]
PCH Temperature 60.0 °C
CPU Core 31.608 W
[S.M.A.R.T.]
ST9500420AS [5VJ7KKKG] 34.0 °C
ST9500420AS [5VJ7KKKG] Airflow 34.0 °C
[ATI GPU[#0] ATI Mobility Radeon HD 5870 (BROADWAY XT/GL)]
GPU Thermal Diode 56.0 °C
GPU TS0 (DispIO) 49.0 °C
GPU TS1 (MemIO) 57.0 °C
GPU TS2 (Shader) 53.5 °C
GPU Clock 405.0 MHz
GPU Memory Clock 1000.0 MHz
GPU Utilization 0.0 %
GPU Fan Speed 30.000 %
[Battery]
Battery Voltage 16.612 V
Current Capacity 69.135 Wh
Current Capacity 97.318 %

fostert
01-02-2012, 09:39 PM
Temps look normal. Reset the BIOS and everything will be stock, JIC something was changed.

I have found that memtest86+ does not always reveal memory errors that say prime95 (or some other stress test) does. It might be wise to test each stick individually with memtest. Also a 1/2 hour run of prime95 blend test might reveal an error too.

clamm
01-02-2012, 10:50 PM
Does this Prime 95 really take months? When I click status, it says it will be complete on February 28!

And, re: Bios, I flashed it back to 211, same issues continue.

clamm
01-02-2012, 11:17 PM
Prime 95 Blend Results after :30 =

[Mon Jan 02 17:13:06 2012]
Self-test 640K passed!
Self-test 640K passed!
Self-test 640K passed!
Self-test 640K passed!
Self-test 640K passed!
Self-test 640K passed!
Self-test 640K passed!
Self-test 640K passed!

fostert
01-03-2012, 12:34 AM
Well if prime95 passes continuously then I'd say your memory can be ruled out! And interesting too that the BIOS didn't fix it. Temps are all good too. I think it can't be cpu/memory subsystem hardware related, as memtest and prime95 are designed to be as tough on hardware as can be possible, and reveal even the most subtle instability in the system. I think we've narrowed it to either the graphics hardware or the software.

Do you have any ASUS apps on there? Some of them are known to cause hangs. Try this: boot into windows and run the task mangler. Manually stop *every* process on there not related to windows; e.g. your nvidia drivers, ASUS ATK stuff (Hcontrol), touchpad software, USB3.0 driver, audio drivers, office software protection platform, virus scan, torrent software, and anything else you have starting. Just leave Windows barebones running with no goodies. Can you replicate the error then?

Could also boot into safe mode and see what happens, but I presume you couldn't run your game from there.

clamm
01-03-2012, 12:56 AM
So I ran it bare bones, still crashed. It seems like the gpu fan slowly increases speed (as temps rise to 82-83c) and then it crashes to BSOD. I don't want to declare video card issue, but seems logical?

And no, I have one to two Asus apps on after the reinstall, but none are running.

fostert
01-03-2012, 01:10 AM
Yeah the I agree its probably the GPU. I have heard on these forums that low-80s in normal and fine for the GPU, but those are normal for the nvidia cards attached to G74s. I have also heard here that mobile ATIs run hotter and are less stable than the nvidias. Nonetheless the ATI should be able to handle a load that puts it to ~80C for a while.

Did you try the "cold room" test?
Can you vacuum out the vents of this thing?

If you find that keeping the system cooled below rm. temps eliminates your problem, then you certainly should RMA it. Or if you want to keep it then you could do a repaste: but you might only gain ~5C with that.

Running out of ideas...

clamm
01-03-2012, 01:19 AM
I did the cold room test, which was easy because its 52F outside, lol. I have alreade blown out the vents. One question, when the fans are running, should I not be feeling heat pushing from both sides of the vents in the rear exhaust areas? Only the right side is blowing out as the GPU heats up. Not sure if that is anything, but I am grasping for straws at this point. If I merely use the machine to surf, or play WOW, life is good. Its really weird.

Also, I deleted every process and it still dumped, same error. Thanks for helping me eliminate what we could, not sure whats next beyond RMA'ing it.

fostert
01-03-2012, 01:51 AM
52F? Thats not cold! Its 7F here in Brandon! Got my G74 compiling by an open window right now, and the cpu temp is 20C!

Dunno what the fan & vent configuration for the G73 is (am a G74 owner), or where the heat is supposed to go. What if you stress the CPU with prime95: does the other fan on the left side become active? If not, maybe you have a dead fan.

Still should test the memory sticks individually before you give up.

If they are all good, and since you cooled it off and observed no improvement, its, unfortunately, RMA time. I think you've tried everything that I would have too, and beyond. At least you can tell them this and that it points squarely to the GPU.

clamm
01-03-2012, 02:20 AM
Any directions for testing them individually? Never done that before.

fostert
01-03-2012, 02:34 AM
Put in a 1x4GB module into slot 1 and run memtest86+ for 3 passes. Do the same for the others.
Or if the computer won't post with only one module in, put in two, test, switch out one of them, test again, until you've covered all possible pairs. If any errors show up you'll be able to nail the faulty stick.

dstrakele
01-03-2012, 05:43 AM
In http://www.sevenforums.com/crashes-debugging/119012-asus-g51j-bsod.html, a G51 owner was encountering similar BSOD's (a STOP 0x124 where the 1st parameter is 0x4, indicating an uncorrectable PCI Express error occurred. There is some indication this issue was resolved by removing the current NVIDIA driver and installing a later version from Safe Mode.

I would try this as a last effort to avoid an RMA. Go to http://rog.asus.com/forum/showthread.php?5517-Latest-NVIDIA-drivers-Official-Beta-and-Modded&p=39896&viewfull=1#post39896 and pick a recent NVIDIA driver (I'd go for 285.74 or 285.62 WHQL, but I won't dissuade you from trying one of the later beta versions - I run the 290.53 beta). Download it and extract it (if necessary), then boot into Safe Mode and choose the "Custom Install" method, being sure to check the "Clean Install" checkbox.

Booting into Safe Mode will help by reducing the number of running drivers that may conflict with a driver installation. Choosing "Clean Install" will uninstall your current NVIDIA driver, hopefully removing any corrupted remnants. If this fails to resolve your issue, you at least can RMA knowing you tried your best to resolve the issue.

clamm
01-04-2012, 12:21 AM
Ds, I have an ATI card (Radeon Mobility 5870). I would imaging that would not work with an Nvidia driver?

dstrakele
01-04-2012, 12:51 AM
Definitely NOT! However, you could attempt to install a more recent ATI driver.

clamm
01-04-2012, 03:07 AM
Ok gang, good news to report here. I read on a forum where someone said to install the 12.1 preview drivers from Ati. I have done so and been crash free while pushing the GPU with the same tests that were causing it previously. I have no idea what the install did, but it eliminated all instances of the WHEA-LOGGER 17 error in event viewer and now the GPU pushes to 97c (and I stopped) with no crashes.

One last question, the fan speed never goes over 30% even at 97c pushed, do I need to overclock this or leave it alone?