Welcome, Guest

Author Topic: overclocking  (Read 3071 times)

vh

  • formerly mudkipz
  • *****
  • Posts: 1140
  • "giving heat meaning"
overclocking
« on: April 28, 2018, 11:48:26 AM »
it's been a while since we've had a new topic in this forum

so i have an i7700k, and for a few months i'd gone with the default motherboard OC settings, which were supposed to set the clock to 4.6 Ghz (default is 4.2).

however, when running this simple benchmark, i found that 4.6 would only kick in if i benchmarked all 8 threads, and if i used only 1 thread, the cpu frequency would stay stuck at 4.2

single thread speed: 580
all threads speed: 3824

in fact the default settings (no motherboard "OC") did even better:

single thread speed: 615
all threads speed: 3662

it only clocks up to 4.4, but it does so consistently for both single thread and multi thread

seeing as i wanted higher clock speeds for single thread workloads, where as a slightly lower frequency would be sufficient for heavier workloads (where thermals are the limitations anyway), i did some custom tweaking, so when there is only 1 thread, i run at 5 ghz and 4.6 with all 8 threads

single thread speed: 685
all thread speed: 3830

this is strictly better than both the default and the default "OC" settings -- almost a 20% gain in single threaded performance! single thread performance, let's face it, is where most bottlenecks in my workflow are

vh

  • formerly mudkipz
  • *****
  • Posts: 1140
  • "giving heat meaning"
Re: overclocking
« Reply #1 on: May 01, 2018, 07:06:07 AM »
ok so i didn't exactly overclock my memory but i found this neat benchmark for it:
https://github.com/ssvb/tinymembench/

Here are my results...
Code: [Select]
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  11470.2 MB/s (2.5%)
 C copy backwards (32 byte blocks)                    :  11484.5 MB/s (0.5%)
 C copy backwards (64 byte blocks)                    :  11477.2 MB/s (0.5%)
 C copy                                               :  12119.4 MB/s (0.6%)
 C copy prefetched (32 bytes step)                    :  11934.4 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :  11945.7 MB/s (0.3%)
 C 2-pass copy                                        :  10096.3 MB/s (0.3%)
 C 2-pass copy prefetched (32 bytes step)             :   9899.5 MB/s (0.2%)
 C 2-pass copy prefetched (64 bytes step)             :   9914.8 MB/s (0.2%)
 C fill                                               :  19430.9 MB/s (5.1%)
 C fill (shuffle within 16 byte blocks)               :  19648.5 MB/s (5.1%)
 C fill (shuffle within 32 byte blocks)               :  19568.6 MB/s (2.6%)
 C fill (shuffle within 64 byte blocks)               :  19707.4 MB/s (2.6%)
 ---
 standard memcpy                                      :  17635.5 MB/s (1.2%)
 standard memset                                      :  39296.3 MB/s (0.2%)
 ---
 MOVSB copy                                           :  13553.6 MB/s (0.3%)
 MOVSD copy                                           :  13571.2 MB/s (0.3%)
 SSE2 copy                                            :  12429.3 MB/s (0.4%)
 SSE2 nontemporal copy                                :  18494.1 MB/s (0.8%)
 SSE2 copy prefetched (32 bytes step)                 :  12137.9 MB/s (0.3%)
 SSE2 copy prefetched (64 bytes step)                 :  12206.1 MB/s (0.4%)
 SSE2 nontemporal copy prefetched (32 bytes step)     :  17166.9 MB/s (0.2%)
 SSE2 nontemporal copy prefetched (64 bytes step)     :  17097.9 MB/s (0.2%)
 SSE2 2-pass copy                                     :  10945.6 MB/s (0.4%)
 SSE2 2-pass copy prefetched (32 bytes step)          :  10485.5 MB/s (0.2%)
 SSE2 2-pass copy prefetched (64 bytes step)          :  10532.4 MB/s (0.2%)
 SSE2 2-pass nontemporal copy                         :   8355.2 MB/s (0.4%)
 SSE2 fill                                            :  19732.5 MB/s (2.8%)
 SSE2 nontemporal fill                                :  46347.8 MB/s (0.3%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    0.7 ns          /     1.0 ns
    131072 :    1.1 ns          /     1.3 ns
    262144 :    1.3 ns          /     1.4 ns
    524288 :    5.1 ns          /     6.8 ns
   1048576 :    7.1 ns          /     8.5 ns
   2097152 :    8.0 ns          /     9.0 ns
   4194304 :    8.6 ns          /     9.2 ns
   8388608 :   11.4 ns          /    12.9 ns
  16777216 :   33.6 ns          /    45.3 ns
  33554432 :   45.4 ns          /    55.4 ns
  67108864 :   52.5 ns          /    59.8 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    0.7 ns          /     1.0 ns
    131072 :    1.1 ns          /     1.3 ns
    262144 :    1.3 ns          /     1.4 ns
    524288 :    4.2 ns          /     5.7 ns
   1048576 :    5.7 ns          /     6.9 ns
   2097152 :    6.4 ns          /     7.3 ns
   4194304 :    6.8 ns          /     7.4 ns
   8388608 :    7.7 ns          /     8.4 ns
  16777216 :   29.1 ns          /    40.7 ns
  33554432 :   39.3 ns          /    49.7 ns
  67108864 :   44.4 ns          /    52.5 ns

some notes, mainly about things i didn't know
1. apparently there's a prefetch intrinsic you can use in gcc/g++. that's news to me
https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Other-Builtins.html#Other-Builtins

2. apparently prefetch doesn't help much, probably since the pattern was easy to guess here

3. there is support for nontemporal memory shenanigans, which is significantly faster than temporal operations.
relevant SO post: https://stackoverflow.com/questions/37070/what-is-the-meaning-of-non-temporal-memory-accesses-in-x86
in a nutshell: weird shit that screws with cache coherency

4. random reads below 16KB are almost instant
relevant SO post: https://stackoverflow.com/q/4087280/1858363
according to tables in that link, 131K and 262K fall into the L2 cache hit range, 8M to 16M fall inside the L3 cache hit range, and 64M+ starts going into DRAM. 32K and below seems to fit in the L1 cache

hmm, so let's try to look up the size of the L1/L2/L3 cache on the i7 7700k with lscpu..
Code: [Select]
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Stepping:            9
CPU MHz:             4697.944
CPU max MHz:         4900.0000
CPU min MHz:         800.0000
BogoMIPS:            8400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

more precisely, we want these 4 lines:
Code: [Select]
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K

it's almost exactly as predicted by the benchmark timings: i predicted L1 was 32K and below... that's exactly the size of the L1. L2 is 256K, and i said it was somewhere around 262K, and i said L3 cache was somewhere around 8 to 16M, and it is.

Cesare

  • *****
  • Posts: 656
  • Universe Sandbox 2 is my favourite simulator.
    • Cesare Vesdani
Re: overclocking
« Reply #2 on: May 10, 2018, 01:46:36 PM »
You need to get yourself a 16 core processor and you would not need to overclock your CPU.

vh

  • formerly mudkipz
  • *****
  • Posts: 1140
  • "giving heat meaning"
Re: overclocking
« Reply #3 on: May 10, 2018, 02:16:29 PM »
that is wrong. a 16 core processor could not do the vast majority of tasks any faster than 4 cores

Darvince

  • *****
  • Posts: 1842
  • 差不多
Re: overclocking
« Reply #4 on: May 10, 2018, 02:22:08 PM »
but what about a 1,378,913,065,775,496,824,682,182,051,857,728,448,902,028,277,271,278,088,224,317,349,054,049,721,856,053,955,032,165,000,485,952,146,958,446,223,387,833,982,704,161,766,047,792,183,079,895,777,875,237,766,653,530,662,154,044,294,980,748,355,504,146,827,894,396,365,898,183,024,673,030,144 core processor

vh

  • formerly mudkipz
  • *****
  • Posts: 1140
  • "giving heat meaning"
Re: overclocking
« Reply #5 on: May 10, 2018, 02:23:06 PM »
that is wrong. a shitpost could not do the vast majority of tasks any faster than shimao's euphoric intellect

Darvince

  • *****
  • Posts: 1842
  • 差不多
Re: overclocking
« Reply #6 on: May 10, 2018, 02:24:54 PM »
excuse you I am suber brian of 43 brain 6x7 plus krishna

vh

  • formerly mudkipz
  • *****
  • Posts: 1140
  • "giving heat meaning"
Re: overclocking
« Reply #7 on: May 10, 2018, 02:28:08 PM »
question: is your 10th grandfather's brain equivalent to your brains' 10th grandfather (grandbrain)?

if so, there's probably some analogy to be made about this in a category theory textbook containing words like "morphism" and "functor"