cancel
Showing results for 
Search instead for 
Did you mean: 

When is an overclocked system considered as "running stable"

pixelwerk
Level 7
I've overclocked my Threadripper 1950x system to 4.0 Ghz @ 1.351 volts and the system is running "solid". With max. temperatures of 75°C under full load for several hours.

Even though it can happen that I end up having a bluescreen when rendering several hours with 100% CPU-load. Today it was a bluescreen after 8 hours but I also had the system rendering for 50 hours on 100% CPU load without a crash.

Is this something I have to live with, when having an overclocked system, or is this already a sign, that the overclocking is beyond a stable point?
6,302 Views
11 REPLIES 11

Korth
Level 14
My work machines are "mission critical", configured to sustain three or four "nines" of uptime, built with as much "server grade" reliability and redundancy as possible. So these machines use ASUS boards and Titanium PSUs and enterprise-class SSDs and quality UPS backups and a robust isolation transformer, with ECC memories and self-powered data caches wherever they're supported. I consider overclocks "stable" on these machines when they can run brutal stress/torture tests for a full week (168 hours), uninterrupted, unsupervised, in a nasty "worst case" 30C to 35C ambient. (Although I do check their status a few times each day, just to confirm they're not crashed or catching on fire, to not waste more time on already failed test burns, but I don't really have to do so. Data corruption is absolutely intolerable, data integrity is highest priority. Downtime is somewhat intolerable, lost uptime equals forced downtime which I'd rather spend doing useful work than spending on troubleshooting/tinkering a failed platform. Yes, it means the overclocks are more marginal/modest and less reckless/spectacular.)

My overclocked gaming machines only need to stress for a few hours, I usually just run them overnight. The logic is that I'm only going to actually demand full performance across the entire duration of a long gaming session (a few hours at a time, at most). A BSoD/crash/lock/reboot or other fault which appears only rarely and only becomes evident after days of relentless punishment is of little concern, especially since whatever data's on the machine isn't "critical" and it can all be restored/rebooted to working order quick enough. So I punch up the overclocks towards the maximums the machines can sustain, turning them down a notch only if crashes/etc become annoying. An important part of this "maximum overclock" approach is rigid adherence to a solid backup strategy.

So it's a personal judgement call. From what I've read online, most gamers/overclockers seem to consider one hour or a few hours sufficient, even ten to twenty minutes is enough to weed out the overwhelming majority of unstable overclocks. It's also not as simple as measuring a definite "pass/fail" threshold, each part has a blurry zone where a decision needs to be made about how far up the curve of performance vs temps/volts/stability/longevity you want to go, the more you shift the balance to parameter(s) on one side the more sacrifice to parameter(s) on the opposite side. Again, it's a personal judgement call.
"All opinions are not equal. Some are a very great deal more robust, sophisticated and well supported in logic and argument than others." - Douglas Adams

[/Korth]

I'm usually running Multi-XEON machines here, so this is my first system that I've overclocked and I have no reference values therefore I was asking. Thanks for your input. The system is planned for everyday usage and not server-tasks. For rendering-tasks I'll still rely on my inhouse renderfarm, so I don't think it'll run for several hours full load. The test I've run were just for stresstesting the system and tweaking the fans. The extra overclocking-Mhz come in quite handy in single-threaded tasks, one of the reasons why I decided to go for a Threadripper-system instead of another Dual XEON system or maybe a dual EPYC-system.

So do you think I can keep the system on this clockspeed or do you think I should better set it back to stock-speed?

JustinThyme
Level 13
Crash=not stable

Now lets talk real world. Are you using it with a 100% load for several hours? What are you using to create that load for several hours? Please dont tell me prime95.......prime95=CPU space heater and in no way reflects any real world scenarios. If you want to use it for short duration's to test you cooling capabilities great. Othe than that use something like 3Dmarks, Heaven, Real Bench etc.



“Two things are infinite: the universe and human stupidity, I'm not sure about the former” ~ Albert Einstein

JustinThyme wrote:
Crash=not stable

Now lets talk real world. Are you using it with a 100% load for several hours? What are you using to create that load for several hours? Please dont tell me prime95.......prime95=CPU space heater and in no way reflects any real world scenarios. If you want to use it for short duration's to test you cooling capabilities great. Othe than that use something like 3Dmarks, Heaven, Real Bench etc.


Rendering animations and highres stills with V-Ray, the purpose of the workstation. I was also running some synthethic benchmarks, but most of the testing I was doing realworld-scenarios

pixelwerk wrote:
Rendering animations and highres stills with V-Ray, the purpose of the workstation. I was also running some synthethic benchmarks, but most of the testing I was doing realworld-scenarios


Which real world scenarios?



“Two things are infinite: the universe and human stupidity, I'm not sure about the former” ~ Albert Einstein

JustinThyme wrote:
Which real world scenarios?

Real world scenario for me means the things I'm using on a daily base. Simulating, raytracing, compositing and video editing

pixelwerk wrote:
Real world scenario for me means the things I'm using on a daily base. Simulating, raytracing, compositing and video editing


And that loads your CPU to 100% for hours on end?



“Two things are infinite: the universe and human stupidity, I'm not sure about the former” ~ Albert Einstein

Korth
Level 14
Backup your data frequently and run your rendering tasks ... if the machines (or their overclocks) prove unstable then upgrade (or turn things down a notch) and repeat.

There's no need to obsess over exact failure thresholds, only a need to demonstrate that normal usage doesn't exceed them. Little sense in spending more time testing the machines than would be needed to recover from failures.
"All opinions are not equal. Some are a very great deal more robust, sophisticated and well supported in logic and argument than others." - Douglas Adams

[/Korth]

My backup is pretty bulletproof with an external RAID and even LTO-tape drive backup and no projects are stored on the workstation locally, so even if the workstation would literally burn down to the ground, nothing is lost...datawise.

Thanks a lot for your feedback