[Alder Lake] RAM geometry detected incorrectly #425
-
Hi, I have a CLEVO PD70PNN1 laptop with 32GB of DDR5 RAM (2 x 16GB), but CoreFreq reports only a single 16GB DIMM in the DIMM geometry. However, the Output of If it matters, I'm running Linux Mint 21. Is there any other information I can provide to help?
|
Beta Was this translation helpful? Give feedback.
Replies: 37 comments
-
Hello, First CoreFreq report with Please post here the full output of |
Beta Was this translation helpful? Give feedback.
-
Hi! Here is
|
Beta Was this translation helpful? Give feedback.
-
Hello, Can you pull and try the new changes in the I will need a refresh of Thank you |
Beta Was this translation helpful? Give feedback.
-
Btw generation 11th up to Raptor Lake can now experiment the SA voltage monitoring. You have to build this way:
In the UI, switch to the |
Beta Was this translation helpful? Give feedback.
-
Here's Seems to detect all 32GB correctly now. Thanks!
|
Beta Was this translation helpful? Give feedback.
-
It works but not as the way I was expecting it to do! Bus rate has decreased to 1900 MHz with DDR5 about 2400 MHz. Does it sound good to you ? Other addition for your platform is |
Beta Was this translation helpful? Give feedback.
-
EDIT: Sorry, I meant to say IBT (indirect branch tracking), not BTI. The kernel option is Original post below. Yes, the geometry seems weird. As for the bus rate, I'll need to check what the BIOS reports, and post back later today. I have to admit I'm not that familiar with processor internals. What is TCO in this context? I didn't find anything useful on the internet due to the overloaded acronym - even Intel themselves only talk about total cost of ownership. I'll look in the BIOS for a TCO setting. Speaking of processor features, one detail I forgot to mention - I have Also, for what it's worth, this is a new system, and there are still some other random issues I need to debug:
These seem unlikely to be CPU-related, so just for information :) I'll get back to you later today after I've looked at the BIOS settings. |
Beta Was this translation helpful? Give feedback.
-
This definition of TCO and datasheet I'm using to program Registers |
Beta Was this translation helpful? Give feedback.
-
BIOS reports DRAM frequency as 4800 MHz. (I suppose this is including the "double" in the DDR.) And after booting Linux again, with nothing changed, the memory bus is now at almost 4 GHz?!
Definition and datasheet - thank you, very interesting! Summarizing, in this context TCO is a low-level system crash watchdog, and the acronym indeed means total cost of ownership. No wonder it was hard to find. :) The BIOS on this machine has no settings for TCO - or indeed that many settings at all. It is Here's a list of available BIOS settings, other than TPM setup:
I'm not seeing any kernel modules with "TCO" in the name. Here is the full output of
|
Beta Was this translation helpful? Give feedback.
-
Ah, and I tested enabling TCO in CoreFreq, in Window ⊳ Technologies . Here's
|
Beta Was this translation helpful? Give feedback.
-
Cause of DRAM frequency discrepancy possibly found. After booting the system, the memory bus is at 4000 MHz. But if I suspend and resume, it drops to 1900 MHz. Next, to figure out why this happens... By the way, thanks for developing CoreFreq, which makes this kind of detailed analysis possible! I originally installed CoreFreq to be able to monitor individual core clock rates and temperatures, for performance profile tuning. :) |
Beta Was this translation helpful? Give feedback.
-
Thanks for your return. Line 4544 in 1cd8f35 |
Beta Was this translation helpful? Give feedback.
-
Ok, good to know, thanks. The weird thing is, once I suspend/resume, the reported bus rate stays around 1900 MHz all the way until the next boot - it doesn't fluctuate. Also, the 4000 MHz after a fresh boot remains stable all the way until I suspend/resume the machine, it doesn't fluctuate either. It also doesn't matter whether the machine is stressed or not, so if the reading is accurate, whatever is happening, it doesn't seem a dynamic performance scaling issue. I read about Alder Lake and XMP 3.0, but this BIOS has no settings for that, either, so all I have to go on regarding the memory bus rate is what CoreFreq tells me. Or what other tools tell me - for comparison, I tried CPU-X, which says "Kingston KF548S38-16, 16384 MB @ 4800 MHz (SODIMM DDR5)" for each of the two RAM slots (this is after a suspend and resume). I've run some test loads (AI training on GPU using a custom code built on TensorFlow; and an MPI-distributed FEM code on CPU, specifically a custom Navier-Stokes solver built on FEniCS). Performance for both seems normal after suspend/resume, but to be sure, I'll have to re-check with a fresh boot. |
Beta Was this translation helpful? Give feedback.
-
What is not trivial to read from source code is that CSR Registers like |
Beta Was this translation helpful? Give feedback.
-
To read IMC data and up to third group of timings: HWiNFO and OCCT in the Windows world. Memtest86+ as bare-metal may provide the primary group of timings in Alder Lake but also the IMC frequency. |
Beta Was this translation helpful? Give feedback.
-
Regarding gear mode, yeah, should be taken into account. Also, doesn't seem to be documented well. Out of curiosity, I took a look at the guide, and gear is not mentioned anywhere on any of the 5060 pages of the document, despite being up to date up to 13th gen. :) I noticed that Tom's hardware mentioned that a tool called EDIT: Ah, right, you mentioned HWInfo already :) |
Beta Was this translation helpful? Give feedback.
-
You rather read the datasheets of the 13th generation where gear is somehow specified. See my wiki for doc references. |
Beta Was this translation helpful? Give feedback.
-
Ah, thank you! According to vol. 2 of the Raptor Lake datasheet, section Scheduler Configuration (SC_GS_CFG_0_0_0_MCHBAR) — If bit 31 is set, the MC is in GEAR2, and if bit 15 is set, the MC is in GEAR4:
On a side note, OS access to this register is R - only the BIOS has RW access. Which is probably a good thing. :P For comparison, I also checked vol. 2 of the Alder Lake datasheet, same section (pp. 137-139), and the offset and the bit numbers are the same. So we should be able to read the gear the same way in both gen 12 and gen 13. Do you want to have a go at implementing this (I can test it), or alternatively, care to point me at the relevant part of the source code so I can try? EDIT: add the note this register is used for Scheduler configuration, as per the docs. |
Beta Was this translation helpful? Give feedback.
-
Hmm, there's also a 32-bit MMIO register at Memory Controller BIOS Request (MC_BIOS_REQ_0_0_0_MCHBAR_PCU) — Offset 5E00h (pp. 202-203 in Raptor Lake docs, pp. 184-185 in Alder Lake docs):
And then there's a 32-bit MMIO register at Memory Controller BIOS Data (MC_BIOS_DATA_0_0_0_MCHBAR_PCU) — Offset 5E04h (pp. 203-204; respectively pp. 186-187):
OS has R access to both of these registers, too. Right now, I don't know which of these is best - maybe try all of them and see what they report? |
Beta Was this translation helpful? Give feedback.
-
@Technologicat You will peek the register void Query_ADL_IMC(void __iomem *mchmap, unsigned short mc)
{ /*Source: 12th Generation Intel® Core Processor Datasheet Vol 2 */
unsigned short cha;
unsigned int value = 0;
value = readl(mchmap+0x5E04);
printk("Register=%x\n", value); Next rebuild, reload driver and print kernel log to read the register output in hexadecimal. make clean all
rmmod corefreqk
insmod ./corefreqk.ko
dmesg Register=abcd1234 |
Beta Was this translation helpful? Give feedback.
-
I'll also be away for a few days, so a short update for now:
Also, while strictly unrelated, but I've babbled so much about my setup in this thread that other users with a CLEVO (More related to RAM, in general, is the random fact that the VRAM on the GPU runs on a 7 GHz clock rate. I hadn't realized GDDR was that fast.) So I'll report that I tried underclocking the GPU (cores as well as VRAM) by 10%. This did it - it seems the crashes are gone. My hypothesis is that the crashing is likely caused by the infamous transient power spikes of the RTX 30xx GPU series, briefly overwhelming the power supply capabilities of the laptop. The power brick is 200W, but there is also the battery subsystem to consider. (I have no idea whether the power always passes through the battery subsystem on this model.) Note the GPU TDP is 125W, and for the i7-12700H CPU, 45W. The rest of the system also needs some power. So if the GPU draws much more power even for a short while, the system may brown-out and crash. Also note that the crashes occur even with the GPU sustained power draw limited to 80W ( So if you ended up here from a search engine, and are experiencing similar issues on a similar laptop, here's what I did:
As a concluding side note, another solution for troubleshooting GPU power issues, seen on the internet, is:
This disables performance scaling of the GPU, leading to a more predictable power draw. Note that in this mode, the GPU will consume more power when idle. You can also change the setting in the GUI, as well as see the current value of the setting, by running just Also note that changing the PowerMizer mode has mostly been suggested for "the GPU has fallen off the bus" errors, not for random system crashes. I tried on both |
Beta Was this translation helpful? Give feedback.
-
Ok, I'm back. Here's the raw data from registers 0x5E00 and 0x5E04 (sampled both out of curiosity):
Or in binary,
As for In unrelated news, I spoke too soon - the crashes weren't gone yet, just occurred much less often. No crash in up to 2h of gaming, then boom. It's still very rare to get the system to crash with anything other than Cyberpunk, but the fact it has happened twice with other loads tells me it's not the game. But because the crash occurs the most often with it, this specific game is an ideal test load. Trying the Unigine Valley GPU benchmark, I noticed that when power-limited to 80W, the GPU clock rate jumps around a lot. Reading this comment on undervolting got me thinking, and I recalled Indeed, locking the maximum GPU clock rate at 1.785 GHz * (80W / 125W) ~ 1.1 GHz, where 125W is the default TDP and 1.785 GHz the default GPU clock rate, the GPU at full load stays just under the power limit, and does not need to throttle the clock rate, according to monitoring via nvtop. No need to touch the VRAM clock rate, it can run at the default 7 GHz. Tested yesterday. Two hours of idling in-game, and one hour of gaming (later, in another session). So far stable. (And as a note for other gamers who happen upon this, 80W / 1.1 GHz on the RTX 3070 Ti mobile gives about 30-40 FPS, which to me is perfectly playable in a role-playing game. The much lower fan noise level is well worth the FPS hit. Your mileage may vary.) |
Beta Was this translation helpful? Give feedback.
-
Perhaps, if not null (aka register value not equal
If this can help, CoreFreq is monitoring some hardware event bits. |
Beta Was this translation helpful? Give feedback.
-
One important thing I forgot to mention: upon loading the module, CoreFreq read the IMC registers 8 times. Only the first read gives any useful results. On second and further reads, the value in both registers is
Ah, thank you! It's very nice that CoreFreq runs in a terminal, so I can SSH in from another machine and run it over the SSH session when a fullscreen benchmark is running :) Also, for NVIDIA GPU tuning, I forgot to mention that From the (I don't know why the extra "1". My first thoughts were that either someone at NVIDIA likes 2001: A Space Odyssey, or has set things up well in advance for a reference to the old Over 9000! internet meme, once future VRAM gens hit 9 GHz. But more likely it's due to rounding.) |
Beta Was this translation helpful? Give feedback.
-
Is this on the same channel, same controller ? Because driver loops over 8 possible controllers, 12 channels each, considering latest Zen architectures. Line 2000 in 34efe5d Different controllers, different channels will certainly return the same timings but Registers may slightly changed. So it appears better to probe all of them when feasible. |
Beta Was this translation helpful? Give feedback.
-
So forget my GFX events, those are for the Intel integrated iGPU. |
Beta Was this translation helpful? Give feedback.
-
Thanks. Good catch. Printing the value of the
so once per controller only, as expected.
Ah, right, good point. I have that disabled for now. |
Beta Was this translation helpful? Give feedback.
-
Original issue covered |
Beta Was this translation helpful? Give feedback.
-
Yes, working correctly. Thank you! P.S. Noticed that |
Beta Was this translation helpful? Give feedback.
-
P.P.S. Final note for other gamers: again, the crashes were not yet gone, just occurred less often. The real culprit turned out to be the CPU - most likely, a compatibility issue either with the Linux kernel or with some specific software (such as Cyberpunk). I disabled the e-cores, and haven't had a single crash since then. The way I thought of testing this possibility was reading some anecdotal reports on the internet about game instability with e-cores, and about PC crashes in video transcoding with e-cores enabled. Also, I got a semi-reproducible crash by merging Stable Diffusion checkpoints, making this easier to test. When the crash happened, the hw monitors ( Note that the BIOS setting for legacy game compatibility mode does nothing in Linux; instead, use the features provided by Linux to turn off individual cores. An easy GUI way is to create a profile in TUXEDO Control Center, and in that profile, set the number of logical cores to 12. Then the system will use the p-cores and their hyperthreads, as can be confirmed by |
Beta Was this translation helpful? Give feedback.
Hello,
You can pull
master
branch where the IMC Bus is now computed from BIOS PLL ratio. The resulted frequency has to be static.Reading this Intel post, I believe that the Gear mode should be taken account but I'm not sure which registers are involved.
Can you please show the refresh of
corefreq-cli -M