The Edison main SoC is a 22 nm Intel Atom “Tangier” (Z34XX) that includes two Atom Silvermont (SLM) cores. Although never advertised by Intel, the CPU is known to be 64 bits (x86_64) capable.
1) Feel free to add Edison / NUC E3815 or other Baytrail examples here.
There are quite a number of disadvantages:
From the Intel® 64 and IA-32 Architectures Optimization Reference Manual 22.214.171.124:
- The total length of the instruction bytes that can be decoded each cycle varies by microarchitecture.
SLM: up to 16 bytes per cycle with instruction not more than 8 bytes in length. For an instruction length exceeding 8 bytes, only one instruction per cycle is decoded on decoder 0.
- An instruction with multiple prefixes can restrict decode throughput. The restriction is on the length of bytes combining prefixes and escape bytes. There is a 3 cycle penalty when the escape/prefix count exceeds the following limits as specified per microarchitectures.
SLM: the limit is 3 bytes.
- Only decoder 0 can decode an instruction exceeding the limit of prefix/escape byte restriction on the Silvermont and Goldmont microarchitectures.
- The maximum number of branches that can be decoded each cycle is 1 for SLM. Prevent a re-steer penalty by avoiding back-to-back conditional branches.
Unfortunately x86_64 mode will add a prefix byte to instructions that are already long. For instance CRC32Q will exceed the limit causing a 3 cycle penalty, which totally destroys the obtained performance enhancement.
Fortunately there is a way around this restriction. Again from the Intel manual 126.96.36.199, engauging the Loop Stream Detector (LSD):
The Silvermont and Goldmont microarchitectures include a Loop Stream Detector (LSD) that provides the back end with uops that are already decoded. This provides performance and power benefits. When the LSD is engaged, front end decode restrictions, such as number of prefix/escape bytes and instruction length, no longer apply.
It appears the LSD can kick in for short loops, and after a certain amount of loops occured (although this is not clearly documented the number is probably 64). To use this, take care not to have the compiler unroll your loop. The effect can be quite dramatic, as the 3 cycle penalty is eliminated after 64 iterations a a 3x speed up can be observed for long running loops.
KBUILD_DEFCONFIG="x86_64_defconfig" and set
DEFAULTTUNE = "core2-64".
Alternatively you can checkout
kirkstone which will build a x86_64 ACPI enabled version.
© 2018 Ferry Toth