FMUSER Wirless Transmit Video And Audio More Easier !

[email protected] WhatsApp +8618078869184
Language

    Samsung EXYNOS M3 architecture exposure of Cortex A76

     

    "Earlier this year, Samsung announced the Exynos 9810 flagship processor CPU is the third generation of Samsung's self-developed mongoose (Mongoose) architecture:. Exynos M3 along, little is known about the outside world for Exynos M3, and finally in yesterday's meeting HotChips Samsung official disclosure of the detailed design Exynos M3, so that the outside world a glimpse of this super architecture. When Samsung Exynos 9810 has just been published, the outside world was widely believed that in terms of performance made a huge leap, and even data show, GeekBench score even single-threaded Exynos 9810 up to 3400, including far more than 845, including other Xiaolong Andrews camp processor. Samsung's infrastructure development is carried out in the R & D center of Samsung Austin, Texas (SARC, Samsung's Austin R & D Center), the Center was established in 2010, the goal is Samsung's Exynos mobile phone R & S Series family of processors. Be Samsung's first generation of self-study architecture Exynos M1 is said that sometime in 2012 had already started to design, after just three years research and development, Exynos 8890 to facilitate the 2016 debut in Galaxy S7. Exynos M1 is completely designed from scratch, subsequent generations of architecture will naturally as the starting point for the next step. With the elimination of Exynos M1, SARC team began using existing Exynos M1 RTL in the first quarter of 2015 to proceed with the Exynos M3 design. Exynos M3 initial position is an incremental development, but in the first quarter of 2016, the development plan there is a major change in architecture of the target is set higher in order to achieve greater performance, originally set for the incremental development the target is split into Exynos M2 (ie Exynos 8895 processor). Improvement rate Exynos M2 IPC on each workload up to 20%, which makes it even when the frequency is reduced under 12%, the final performance is still better than Exynos M1. By adding some original Samsung Exynos M3 features designed to achieve this growth target in Exynos M2, whereas the new Exynos M3 design then became very aggressive. Third-generation architecture mongoose Contrast Exynos M3 and Exynos M1 can see a lot of similarities, but Exynos M3 increased quite a lot of expansion. SARC team microarchitecture of the decoding section 4 from the spreading width to a width of 6, and an integer ALU kernel added with a multiplier function, a unit and a floating point significantly strengthened / SIMD unit load, lift the computing capacity three times as much. Samsung has never really been disclosed information Exynos M2, nor associated with a particular compiler machine model, but in the HotChips meeting, the Samsung Exynos M2 will be presented out of order modification window size from 96 to 100. Although you can not directly compare between different ISA, but out of order Exynos M3 submission window size is extended again to 228, which is Intel's core design is quite similar. It is noteworthy that, in early 6 Arm Cortex A76 architecture released new information, which is out of order submission window size is 128, smaller than the Exynos M3, Arm out of order submission window represents a 7% increase 1% can only enhance performance, for the balance of performance and area / power considerations, and not focus on enhancing this part of the design. Samsung is at the meeting explained that the rest of the out of order submission window size and architecture, design and back-end scheduler buffer capacity associated with the design, window size and decoder width should be designed to adapt to each other. Exynos M3 decoding unit 6, the width can be well filled scrambled submission window, so as to obtain enhanced performance. Larger-scale front-end For more details continue in-depth front-end architecture, you can see the branch prediction unit and fetch various units of improvement. Exynos M1 branch prediction unit is different from the other architectures capable of receiving two branches having two branch ports and a week at the rear end. Exynos M3 remains seems this width, it will be μBTB (BTB, Misses per kilo instructions, the branch target buffer) to 128 link from the link 64, the main link 4000 BTB remains below 6000 chain Cortex A76 road, but for the delayed branch received an improvement. Further more importantly, the overall quality Exynos M3 branch prediction unit has been improved, where a branch misprediction is reduced by 15%. Samsung announced at the meeting a real MPKI (Misses per kilo instructions, instruction failure rate per thousand) value, which is the Arm and other manufacturers have not released any data. Exynos M3 branch prediction unit and fetch unit queues are provided decoupled address and decoupling the instruction queue, so that the cell can be clocked in execution. Bandwidth Exynos M3 fetch unit is also doubled, each clock cycle period can read up to byte 48 upwards, i.e. 12 32bit instruction fetch and decoding is the ratio of 2: 1 (12 / decoding 6), compared Exynos M2 1.5: 1 (6/4 decoder) has improved. Samsung said, in order to deal with the growing branch bubble problem that requires a substantial increase in infrastructure costs. Although the average pitch is taken branch instruction is less than 12, 2: 1, fetch / decode rate may be wasteful, but the larger the width of burst instruction decode outbreak of great help. While this design has a high instantaneous power usage, but when the instruction queue (now twice the depth) when the fill rate faster than the rate of consumption of the decoding unit, which allows the unit to fetch clock gating, thus power consumption will have a positive impact. Exynos M3 instruction translation lookaside buffer (the ITLB) entry size increased from 256 512 entries. Samsung Arm processor uses a conventional three different hierarchies, Cortex A75 and an ITLB Cortex A76 are respectively 32 and 48 bar, wherein mainTLB 1,280 entries, including entries 1024 (page up to 64KB) and a secondary entry table 256 (pages> = 1MB). Exynos M3 also has a data and instruction TLB, but Samsung did not disclose the size of the L1 ITLB. Pipeline and cache design Exynos M3 decoding unit 6 is designed to width, but Samsung did not disclose any details, saying only improve instruction / uop integration capabilities. Exynos M3 supports multi-scheduling form since Exynos M1 since, given microinstruction decoder can be assigned simultaneously to multiple schedulers, but out of order submission window, it is still only as a dispatcher and an entry. Be Pipeline aspect, the integer part of Exynos M3 adds two additional schedulers microinstruction transmission capability of previous generation 7 to 9 emitting emission enhancement. One of the two additional ports are added ALU unit having a multiplying function of the MUL double the throughput, a simple integer arithmetic and throughput increased by 25%. Another new port is the load AGU, which can make the core of the load double the bandwidth. Floating point capabilities violence Compared with the previous generation, Exynos M3 floating point capability called the violence, Samsung increased its third assembly line, in terms of simple floating point capabilities, its multiplication and arithmetic throughput increased threefold. Exynos M3 in FMAC + FADD unit on the basis of Exynos M1, added 3 128 FMAC / FADD unit, so that the maximum throughput from 3 (1 * FMAC (2) + 1 * FADD (1)) to 6 double (3 * FMAC (2)). Of course, due to the sharp increase in the throughput of the execution, it must extend the scheduler window and the physical page registers. Exynos M3 scheduler window size of from 32 to 62, FP PRF size is also extended from 96 to 192. Samsung has been working to reduce execution delay, this also applies to floating-point pipelines. Exynos M3 execution period is shortened from the multiplication unit 4-3, the execution cycle multiply-accumulate unit also decreased from 5-4; simple cycle floating point addition performed shortened from 3 to 2, and a floating point division unit upgraded to Radix-64 significantly reduce the delay of division. BTW, Arm hype its new Cortex A76 floating point pipeline at the time of publication, called its new "VX (vector execution) assembly line" very proud. But from Exynos M3, the Samsung seems to be within a year will once again lead the Arm, Exynos M3 and Cortex A76 has the same floating-point delay, but at the same time there is a higher execution throughput and lower latency ASIMD . The new load / store unit Exynos M3 load / store unit 128bit adds a second loading port, read bandwidth is doubled. Load call delay remains unchanged within four cycles, memory bandwidth is only a period of a week delay in primary storage. The two loading unit Exynos M3 are 128bit / operation cycle, the same Cortex A76, and Cortex A75 was 64bit / cycle. Be Overall, Exynos M3 load / store scheduler capacity has increased, doubling the storage buffer. Here, Samsung's pre-reading unit needs to have a very high performance to achieve perfect cache hit targets and avoid any memory bottlenecks. Subsequently, he referred to the new Samsung TLB hierarchy described earlier. Exynos M3 Exynos M1 has the same inlet DTLB (Data Translation Lookaside Buffer, converting the detected data buffer) 32, but there is a Exynos M3 additional capacity for the new intermediate DTLB 512 entry, and for the ITLB (Instruction Translation Lookaside Buffer, instruction unified L2 TLB capacity change detection buffer) and served also DTLB previous generation of entries 1024 to 4096 entries expansion. Gains and losses of high-performance pipelined Obviously, a price scale expansion architecture, compared to 15 Exynos M1 pipeline, the pipeline depth Exynos M3 increased level 2, is 17, and the added auxiliary scheduling stage and a second stage of the read register . In contrast, the pipeline depth Cortex A75 and Cortex A76 to 13. Exynos M3 branch prediction error penalty is increased from the 14 cycles to 16 cycles Exynos M1. Exynos M3 Exynos M1 and disadvantages with respect to the public version Arm architecture is that it fetch and decode unit Arm public version than the multistage 2, register renaming for a multi-stage unit 1, and the need for a second phase schedule (more Level 1). Samsung Exynos M3 did not disclose whether there are other fast path between each stage of the pipeline to reduce the delay in critical situations, but said it is necessary to design large-scale cost-performance architecture. Theoretically longer pipeline stages in favor of lifting frequency (Intel Pentium 4 to make the impact of high frequency, using 31 long pipeline had NetBurst architecture of the Prescott core). But the strange thing is, the Samsung 17-stage pipeline does not bring advantages to the frequency Exynos M3, but paid a great impact in the high-frequency power penalty. This seems to imply that Samsung have yet to upgrade the EDA skill. The new cache hierarchy Like Cortex A75 and Cortex A76, Exynos M3 introduces a new exclusive L2 cache, as an intermediate stage between the core and the shared last level cache. The new exclusive L2 of each core 512KB, compared to the shared L2 Exynos M1, access latency is reduced from 22 cycles to 12 cycles. However, Cortex A75 L2 hit latency of only 8 cycles, Exynos M3 at a disadvantage in this regard. Exynos M3 L2 cache bandwidth is doubled, for each clock cycle 32Byte reader, and Cortex A75 16Byte per clock cycle is read / write 32Byte. It is noteworthy, because the actual chip is affected by the physical layout of the actual delay data may be even higher. The Found, Snapdragon L2 845 is approximately 4.4ns delay at 2.8GHz, Exynos 9810 in the delay time of the delay L2 2.7GHz is about 4.6ns. Exynos M3 shared L3 cache buffer is large 4MB NUCA (Non-uniform cache architecture, noncoherent cache hierarchy) manner by the CPU core 4 located opposite 1MB partitions. Due to uneven distribution, the CPU core delay in accessing the cache partitions are not the same, a delay of 32 cycles to access adjacent partitions, the delay period was 44 farthest access partition, Samsung said overall average access delay 37 cycles. Similar to the case of L2, Exynos M3 L3 cache should also be weaker than the public version Arm architecture, Cortex A75 L3 access delay of only 25 cycles, which is also be reflected in the measured data: snapdragon L3 access delay 845 about 11.4ns, and the delay 9810 Exynos was 11ns ~ 20ns. Samsung at the meeting explained that this cache partitions designed for different design than the high-end mobile devices for better configurability, the implication seems to S.LSI tried to enter the automotive sector related. Overall, Samsung admitted that the cache hierarchy design of the final product did not reach the level they really want. Physical layout on a silicon chip This year, Samsung announced the data on the physical layout of the chip, detailed analysis of the plan processor core. The following is a brief description of the picture in terms of some of the terms: · PL2: exclusive L2 cache, the Exynos M3 for each core 512KB. · FPB: floating point data path; a floating point unit / ASIMD unit. · FRS: floating point and floating-point scheduler / vector register. · MC: decoding unit and rename the unit. · DFX: This is a debugging / testing logic, stands for "for certain designs," such as DFD (for debuggingDesign), DFT (designed for test), DFM (for manufacturability design) and other miscellaneous logic. • LS: Load / storage unit and 64KB L1 high speed data cache. • IXU: Integer Execution Unit, including execution units, schedulers, and registers. · TBW: Transparent buffering, including TLB structure. · Fe: Architecture front end, including branch prediction unit, referring to the unit, and 64KB L1 high speed instruction cache. Compared with Exynos M1, the functional unit size in Exynos M3 has increased, and the end of the ultimate Exynos M3 is 2.52mm2, and there is also a area of ​​0.98mm2 for 512KB L2 cache. EXYNOS M1 core layout EXYNOS M1 core layout Samsung also demonstrates plan views of the entire Exynos M3 cluster, 4 cores are arranged adjacent to each other, and L2 and L3 are also arranged adjacent to each other. This layout can greatly save the chip layout, just simply copy each module 4 times. IPC an average of 59% At the end of the meeting, Samsung analyzed the performance performance of Exynos M3, called its IPC growth rate of approximately 59%. Of course, under different working loads, the growth of IPC is not linear. Under the workload of high ILP (multiple instructions), IPC growth is limited to 25%, and even in MLP (stored in parallel) workloads Almost no increase, and in many other mixed workloads, IPC has more than 80%. Samsung demonstrates the performance performance of Exynos M2, Exynos M3 and Cortex A75, distribution represents Exynos 8895, Exynos 9810, and Snapdragon 845 performance performance. In addition to the comparison of Samsung display, I also added some new SPEC scores that improve the initial comment data, the new SPEC score considers dynamic voltage frequency adjustment and a more integrated test environment. Energy efficiency has always been an important evaluation criterion for the processor, but Samsung is obviously omitted on Exynos M3. As mentioned above, Exynos M3 has a great power consumption at the high frequency, although it has absolute leadership at 2.7 GHz, but this can be less than Exynos M2 at this time. After the frequency is reduced to 2.3 GHz as the same as Exynos M2, Exynos M3 shows the advantage of energy efficiency. The figure below shows the energy consumption and average power consumption in the test. The bar left the bar represents the energy consumed, and the smaller the strip, the less the strip, the higher the efficiency of the corresponding platform. The strip on the right represents the performance score, the stronger the longest strip. From the results, Exynos M3 has a considerable broad energy efficiency ratio. Compared with Cortex A75, Exynos M3 can have stronger performance and phase-implanted energy ratio when 2.8GHz is compared with Exynos M2, and EXYNOS M3 is overwhelmed by EXYNOS M2. Sex 5.3 GHz full of blood Exynos M2. The EXYNOS M3 cluster in the Exynos 9810 is over 2.7GHz, 2.3GHz, 1.8GHz, and the power consumption is around 3.5 watts, respectively. In other words, the M3 core is doubled from 1.8GHz to 2.3GHz, and the increase of 500MHz frequency is doubled, and from 2.3GHz to 2.7GHz, only 400MHz is upgraded to double again, that is, from 1.8GHz to 2.7GHz Even if the performance is also lifted in linear synchronization, the amplitude is only 50%, and the power consumption has doubled. The Quad-core Exynos M3 of Exynos 9810 is running in the same voltage and frequency. In many cases, the core of the secondary thread does not need to run on the same highest performance point as the main thread core, but the performance needs More than the range of the Cortex A55 small core, the core of the execution of the secondary thread can only be in the same high frequency with the main thread core, greatly reduced the overall energy efficiency ratio. Samsung Future Strategy and Conclusion Finally, Samsung discussed more of the schedule of the Exynos project. As mentioned in the introduction, Exynos M3 plans to start in the second quarter of 2014, but with the completion of M1, RTL began in the first quarter of 2015, the development plan has changed, and the goal is set higher. The original Exynos M3 was split into Exynos M2, and the new Exynos M3 design is extremely radical. RTL delivered to the SOC team in the first quarter of 2017, the first EVT0 flow for Exynos 9810 (the actual output chip is evt1), and offline offline in the middle of 2017, and finally released Exynos 9810 is Marked in March 2018. Exynos M3 is a considerable challenge for the Samsung design team, and the entire architecture design is almost inverted, and it must also face extreme time pressure and launch the product before the project deadline. Overall, Exynos M3 is a very solid micro-architecture that feels more like a desktop architecture. Samsung has adopted the simplest and rude approach to expand the scale, which also leads to its huge kernel size. However, due to time limit, Samsung seems to still retain many improvements that are not included in Exynos M3, especially the cache hierarchy seems to be the weakest part of this architecture. Samsung admits that they are not satisfied. Samsung's cat architecture has experienced two generations of Exynos M2 and Exynos M3, with IPC growth rates as high as 20% and 59% respectively. It is reported that the SARC design team now has a continuous architecture improvement every year, and Samsung means that the real competitor of ARM Cortex A76 is next year's Exynos M4, not the present EXYNOS M3. Just a few days ago, ARM announced its future CPU roadmap, revealing the successor DEIMOS and Hercules of Cortex A76, and promised approximately 15% and 10% intergenerational benefits. From the current test data, Exynos M3 seems to have reached or exceeds the level of Cortex A76 in performance, and we can still see the competitive advantage of Samsung Self-research architecture in Exynos M4. Thanks to Samsung's architecture exposure, this detail showcases in the industry. I hope that S.LSI and SARC can solve the weaknesses of the Exynos M3 architecture, making it greater success next year's new architecture. Original address: https://www.eeboard.com/news/cortexa76exynosm3/ Search "" Love Board "", pay attention, daily update development board, intelligent hardware, open source hardware, activity and other information can make you fully master. Recommended attention! [WeChat scanning picture can be paid directly] "

     

     

     

     

    List all Question

    Nickname

    Email

    Questions

    Our other product:

    Professional FM Radio Station Equipment Package

     



     

    Hotel IPTV Solution

     


      Enter email  to get a surprise

      fmuser.org

      es.fmuser.org
      it.fmuser.org
      fr.fmuser.org
      de.fmuser.org
      af.fmuser.org ->Afrikaans
      sq.fmuser.org ->Albanian
      ar.fmuser.org ->Arabic
      hy.fmuser.org ->Armenian
      az.fmuser.org ->Azerbaijani
      eu.fmuser.org ->Basque
      be.fmuser.org ->Belarusian
      bg.fmuser.org ->Bulgarian
      ca.fmuser.org ->Catalan
      zh-CN.fmuser.org ->Chinese (Simplified)
      zh-TW.fmuser.org ->Chinese (Traditional)
      hr.fmuser.org ->Croatian
      cs.fmuser.org ->Czech
      da.fmuser.org ->Danish
      nl.fmuser.org ->Dutch
      et.fmuser.org ->Estonian
      tl.fmuser.org ->Filipino
      fi.fmuser.org ->Finnish
      fr.fmuser.org ->French
      gl.fmuser.org ->Galician
      ka.fmuser.org ->Georgian
      de.fmuser.org ->German
      el.fmuser.org ->Greek
      ht.fmuser.org ->Haitian Creole
      iw.fmuser.org ->Hebrew
      hi.fmuser.org ->Hindi
      hu.fmuser.org ->Hungarian
      is.fmuser.org ->Icelandic
      id.fmuser.org ->Indonesian
      ga.fmuser.org ->Irish
      it.fmuser.org ->Italian
      ja.fmuser.org ->Japanese
      ko.fmuser.org ->Korean
      lv.fmuser.org ->Latvian
      lt.fmuser.org ->Lithuanian
      mk.fmuser.org ->Macedonian
      ms.fmuser.org ->Malay
      mt.fmuser.org ->Maltese
      no.fmuser.org ->Norwegian
      fa.fmuser.org ->Persian
      pl.fmuser.org ->Polish
      pt.fmuser.org ->Portuguese
      ro.fmuser.org ->Romanian
      ru.fmuser.org ->Russian
      sr.fmuser.org ->Serbian
      sk.fmuser.org ->Slovak
      sl.fmuser.org ->Slovenian
      es.fmuser.org ->Spanish
      sw.fmuser.org ->Swahili
      sv.fmuser.org ->Swedish
      th.fmuser.org ->Thai
      tr.fmuser.org ->Turkish
      uk.fmuser.org ->Ukrainian
      ur.fmuser.org ->Urdu
      vi.fmuser.org ->Vietnamese
      cy.fmuser.org ->Welsh
      yi.fmuser.org ->Yiddish

       
  •  

    FMUSER Wirless Transmit Video And Audio More Easier !

  • Contact

    Address:
    No.305 Room HuiLan Building No.273 Huanpu Road Guangzhou China 510620

    E-mail:
    [email protected]

    Tel / WhatApps:
    +8618078869184

  • Categories

  • Newsletter

    FIRST OR FULL NAME

    E-mail

  • paypal solution  Western UnionBank OF China
    E-mail:[email protected]   WhatsApp:+8618078869184   Skype:sky198710021 Chat with me
    Copyright 2006-2020 Powered By www.fmuser.org

    Contact Us