5 Ekim 2020 Pazartesi

Just In Time Compilation - JIT

Giriş
JIT kısaltması "Just In Time Compilation" anlamına gelir. javac komutu, klasik C++ tarzı derleyiciler gibi derleme esnasında bir sürü optimizasyon yapmaya çalışmaz. Sadece byte-code üretir. Bu tür şeyleri çalışma esnasında devreye giren JIT'e bırakır. Yani javac komutu sadece Tier 0 byte-code üretir.

Ahead-of-Time Compilation (AOT)
JIT yavaş olduğu için Java 9'dan itibaren The Ahead-of-Time Compilation (AOT) seçeneği getirildi

Byte-code Kelimesi Ne Anlama Gelir
Byte-code yazısına bakabilirsiniz.

JIT Nasıl Çalışır
JIT için HotSpot'ta iki çeşit derleyici (compiler) vardır. Şeklen şöyle

Bu derleyicilerin isimleri C1 ve C2. Detayları JIT C1 ve C2 Derleyici yazısına taşıdım

GraalVM'de de C1 olduğu görülebilir. C2'den emin değilim. Aslında GraalVM için açıklama şöyle
Up until now, GraalVM has offered two ways to run Java programs: using the Java HotSpot VM with the GraalVM JIT (just-in-time) compiler, and compiled to a native executable using GraalVM Native Image

Today, we’re happy to announce a new way to run Java on GraalVM. GraalVM 21.0 introduces a new installable component, named espresso, which provides a JVM implementation written in Java.
Yani GraalVM de kendi içinde barındırdığı Java HotSpot VM'e başka alternatifler de sunmaya başlamış. Espresso bileşeni "Truffle Language Implementation Framework" kullanılarak yazılmış bir JVM

JIT'in Geç Çalışmasının Eksi Tarafları Nelerdir
Açıklaması şöyle. Burada sanırım en önemli madde JVM'in kapanırsa tüm JIT'in en baştan yapılmak zorunda olması
The actual compiled code is therefore very fast. But there are three downsides:

1. A method needs to be called a certain number of times to reach the compilation threshold before it can be optimised and compiled (the limit is configurable but typically around 10,000 calls). Until then, unoptimised code is not running at “full speed”. There is a compromise between getting quicker compilation and getting high-quality compilation (if the assumptions were wrong there will be a cost of recompilation).

2. When the Java application restarts, we are back to square one and must wait to reach that threshold again.

3. Some applications (like ours) have some infrequent but critical methods that will only be invoked a handful number of times but need to be extremely fast when they do (think of a risk or stop-loss process only called in emergencies).

C1 ve C2 İçin JIT Performans Metrikleri Nasıl Toplanır ?
Derleyici byte-code içine bazı profiling proble' lar yerleştirir. Bu probe kodları aslında bir Counter gibi çalışır. Counter bir eşik değerini aşınca bu kod "hot" kabul edilir ve byte-code, makine koduna derlenir. 

Makine koduna derlenirken bazı optimizasyonlar da yapılır. Hangi optimizasyonun yapılacağı, Counter'ın hangi Tier değerine denk geldiğine bağlıdır. Tier 1, Tier2, Tier3 ve Tier4 optimizasyonları farklıdır.

Counter'ın varsayılan eşik değerini değiştirmek istersek -XX:CompileThreshold seçeneğini kullanabiliriz. Açıklaması şöyle
Initially, the compilers insert profiling probes into the bytecode to determine which code paths are the hottest (e.g. by invocation count), invariants (which types are actually used), and branch prediction. Once enough analytics are collected, the compilers actually start to compile bytecode to native code once they are “hot enough” (-XX:CompileThreshold), replacing the existing byte code step by step (mixed mode execution).
Counter için açıklama şöyle
...it makes sense to identify the code, that is run more commonly, and compile them ahead of time, and cache it.  

That is exactly, what later versions of JVMs started doing. A performance counter was introduced, that counted the number of times a particular method/snippets of code is executed.  Once a method/code snippet is used to a particular number of times (threshold), then that particular code snippet, is compiled, optimized & cached, by the “C1 compiler”

Next time, that code snippet is called, it directly executes the compiled machine instructions from the cache, rather than going through the interpreter. This brought in the first level of optimization.

While the code is getting executed, the JVM will perform runtime code profiling, and come up with code paths and hotspots. It then runs the “C2 compiler”, to further optimize the hot code paths…and hence the name “Hotspot”

C1 is faster, and good for short-running applications, while C2 is slower and heavy, but is ideal for long-running processes like daemons, servers, etc, the code performs better over time.

In Java 6, we have an option to use either C1 or C2 methods (with a command-line argument -client (for C1), -server (for C2)), in Java 7, we could use both, and from Java 8 onwards it became the default behavior.
Bir başka açıklama şöyle
Profiler maintaining a counter which counts the number of calls to a particular method. When it passes some threshold value predefined in JVM (You can find this value by using the CompileThreashold flag), the JIT compiler compiles that particular method into native code so that the interpreter can use that native code next time. JIT compiler does not stop its job after compiling the code into machine code. It goes even further. If that particular code block passes the next threshold value, the JIT compiler optimizes the machine code again. This happens in 4 stages. For that, there are two compilers in the JIT compiler. The C1 compiler and the C2 compiler. We also call the client compiler for the C1 compiler and the server compiler for the C2 compiler. It is not about traditional client-server architecture but about the time a Java application runs.
Code Cache Nedir?
Açıklaması şöyle. Yani C1 ve C' derleyiciler tarafından derleme için kullanılan ve JVM Heap dışında olan bir alan
The memory area that the JIT compiler uses for this code compilation is called "code cache." This area resides outside of the JVM heap and metaspace. To learn about different JVM memory regions, you may refer to this video clip.
Açıklaması şöyle
There is a memory area called code cache. It is like the cache memory in our computer system. It has limited memory, and we can tune code cache with JVM flags. When some code block is called even more, then the JIT compiler moves that method into the code cache so that the interpreter can access that particular code block quickly. But the JIT compiler does not add all the codes into this code cache since it makes a tradeoff.
Code Cache büyüklüğü ayarlanabilir. Açıklaması şöyle
One of the interesting things is to find the size of the code cache. This can be done by using -XX:+PrintCodeCache. The maximum number of code cache is dependent on which version of java you are using. We can change the code cache size with these flags:

-XX:InitialCodeCacheSize=n: initial size
-XX:ReservedCodeCacheSize=n: max size
-XX:CodeCacheExpansionSize=n: how quickly the cache can grow

the size can be in bytes, kilobytes (by adding suffix k), megabytes (by adding suffix m),…

There is also an application support for monitoring the code cache size. By using JConsole application you will get quite nice graph representation.
Açıklaması şöyle
-XX:ReservedCodeCacheSize=N
The code that the Hotspot JIT compiler compiles/optimizes is stored in the code cache area of the JVM memory. The default size of this code cache area is 240MB. You can increase it by passing -XX:ReservedCodeCacheSize=N to your application. Let's say you want to make it 512 MB. You can specify it like this: -XX:ReservedCodeCacheSize=512m. Increasing the code cache size has the potential to reduce the CPU consumption of the compiler threads.

JVM Compiler Interface - JVMCI
Java 9 ile geliyor. Açıklaması şöyle. Bu arayüzü kullanarak C1 ve C2 dışında farklı derleyiciler takabilmek mümkün. Açıklaması şöyle
In Java 9, JVMCI (JVM Compiler Interface) was introduced. This allowed writing compilers, as plugins, that JVM can call for dynamic compilation. It provides an API and a protocol to build compilers with custom implementations and optimisations.
...
The word compiler, I am referring to here, is not the javac type of compiler, this is the compiler within the JVM, (like C1, C2), that converts the bytecode to optimized machine code (instead of interpreter, as explained in the previous section)
Derleyiciler Ne Gibi Optimizasyonlar Yaparlar?
JIT Code Optimizations açıklaması şöyle
Here are some of the code optimization, that the JVM compiler:

Removing null checks (for the variable that are never null)
Inlining smaller, most called methods (small methods) reducing the method calls
Optimizing the loops, by combining, unrolling & inversions
Removing the code that is never called (Dead code)
and many more…

Whatever said and done, JIT (Just-In-time compilation) is slow, as there is a lot of work that the JVM has to do in the runtime.
Constant Folding Optimizasyonu
Açıklaması şöyle
Starting with the hot path, one of the first things the compiler tries to achieve is constant folding. Using partial evaluation and escape analysis, the compiler will try to determine if certain constructs can be reduced to constants (e.g. the expression 3 * 5 can be replaced with 15). 
Inlining Optimizasyonu
Açıklaması şöyle
Another rather simple optimization is to avoid method calls by inlining methods into their call sites (if they are small enough).
Intrinsics Optimizasyonu
Intrinsics Optimizasyonu yazısına taşıdım

Lock Coarsening ve Lock Elision Optimizasyonu
Açıklaması şöyle
Multi-threaded applications may as well benefit from the optimizations the JIT can do with synchronization locks. Depending on the locks used, the compiler may merge synchronized blocks together (Lock Coarsening) or even remove them completely if escape analysis determines that nobody else can lock on those objects (Lock Elision).
JIT'i Kapatmak
Bir kaç tane yolu var. Hangisi en iyi bilmiyorum
Örnek
Şöyle yaparız.
-Djava.compiler=NONE
Örnek
Şöyle yaparız
-XX:-TieredCompilation
Açıklaması şöyle
Pass this -XX:-TieredCompilation JVM argument to your application. This argument will disable the JIT HotSpot compilation. Thus CPU consumption will go down. However, as a side-effect, your application’s performance can degrade.
Örnek
Xint interpreted mode anlamına gelir. Açıklaması şöyle
-Xint

Operate in interpreted-only mode. Compilation to native code is disabled, and all bytecodes are executed by the interpreter. The performance benefits offered by the Java HotSpot Client VM's adaptive compiler will not be present in this mode.
Şöyle yaparız
java -Xint CounterJitTest
Örnek
Belli bir sınıfın JIT olmaması için şöyle yaparız.
XX:CompileCommand=exclude,src/main/BasicHolder.getVALUE
Örnek
Bir paketteki tüm sınıfların JIT olmaması için şöyle yaparız
-XX:CompileCommand=exclude,com.sun.xml.internal.messaging.saaj.client.p2p.*::*
Derleyiciyi Seviyelerini Kapatmak
-XX:TieredStopAtLevel=N kullanılır. Açıklaması şöyle. N yerine 0,1,2,3,4 değerleri geçilebilir.
If a CPU spike is caused because of C2 compiler threads alone, you can turn off C2 compilation alone. You can pass -XX:TieredStopAtLevel=3. When you pass this -XX:TieredStopAtLevel argument with value 3, then only C1 compilation will be enabled and C2 compilation will be disabled. 
C2 Derleyici İçin Thread Sayısını Belirlemek
-XX:CICompilerCount=N kullanılır. Açıklaması şöyle
You can consider increasing the C2 compiler threads by using the argument -XX:CICompilerCount. You can capture the thread dump and upload it to tools like fastThread.There you can see the number of C2 compiler threads. If you see a fewer number of C2 compiler threads and you have more CPU processors/cores, you can increase the C2 compiler thread count by specifying -XX:CICompilerCount=8 argument.
Açıklaması şöyle
The default number of C1 and C2 compiler threads are determined based on the number of CPUs that are available on the container/device in which your application is running. Here is the table which summarizes the default number of C1 and C2 compiler threads:
CPUS c1 Threads c2 Threads
1         1         1
2         1         1
4         1         2
8         1         2
16         2         6
32         3         7
64         4         8
128         4         10


You can change the compiler thread count by passing -XX:CICompilerCount=N JVM argument to your application. One-third of the count you specify in -XX:CICompilerCount will be allocated to the C1 compiler threads. The remaining thread count will be allocated to C2 compiler threads. Suppose you are going to 6 threads (i.e., -XX:CICompilerCount=6), then 2 threads will be allocated to C1 compiler threads and 4 threads will be allocated to C2 compiler threads.


JIT ve Event'ler
Hotspot için iki tane event var. Bunlar 
- Uncommon trap ve
- deoptimization
Bu event'ler hs_err_pid.log gibi crash dosyalarında görülebilir. Şöyle bir çıktı görürüz
Events (10 events):
Event: 2603309.010 Thread 0x00007ff2c800c000 DEOPT UNPACKING pc=0x00007ff34aaddf69 sp=0x00007ff3409e88a8 mode 2
Event: 2603310.108 Thread 0x00007ff362229000 DEOPT PACKING pc=0x00007ff34b25ce6c sp=0x00007ff340ceb660
Event: 2603310.122 Thread 0x00007ff2c8009800 Uncommon trap: trap_request=0xffffff65 fr.pc=0x00007ff34b890e40
Event: 2603310.124 Thread 0x00007ff2c8009800 DEOPT PACKING pc=0x00007ff34b890e40 sp=0x00007ff3408e7790
Event: 2603310.124 Thread 0x00007ff2c8009800 DEOPT UNPACKING pc=0x00007ff34aaddf69 sp=0x00007ff3408e7680 mode 2
Event: 2603310.125 Thread 0x00007ff2c8009800 Uncommon trap: trap_request=0xffffff65 fr.pc=0x00007ff34b850fe4
Event: 2603310.125 Thread 0x00007ff2c8009800 DEOPT PACKING pc=0x00007ff34b850fe4 sp=0x00007ff3408e7560
Event: 2603310.125 Thread 0x00007ff2c8009800 DEOPT UNPACKING pc=0x00007ff34aaddf69 sp=0x00007ff3408e72d8 mode 2
Event: 2603310.126 Thread 0x00007ff362229000 DEOPT UNPACKING pc=0x00007ff34aaddf69 sp=0x00007ff340ceb628 mode 2
Event: 2603310.935 Thread 0x00007ff2d8001000 Thread added: 0x00007ff2d8001000
Uncommon trap Event Nedir
Açıklaması şöyle
When code generated by C2 reverts back to the interpreter for further execution. C2 typically compiles for the common case, allowing it to focus on optimization of frequently executed paths. For example, C2 inserts an uncommon trap in generated code when a class that is uninitialized at compile time requires run time initialization.
Bu şu anlama gelir
Örnek
Elimizde şöyle bir kod olsun. Burada ilk if çok işletildiği için JIT yapılır, ancak else dalı JIT yapılmaz. Buraya bir trap yerleştirilir. Eğer bir gün else çalışırsa bu trap yakalanır ve byte code interpret edilir.
long now = start;
while (true) {
  if (now < start + TimeUtils.NANOS_IN_SECOND * delay)
  {
    now = TimeUtils.now();
  } else {
    // Will be printed after 30 sec
    if (TimeUtils.now() > start + TimeUtils.NANOS_IN_SECOND * (delay + 30)) {
      final long finalNow = now;
      System.out.println("Time is over at " +
                        TimeUtils.toInstant(finalNow) + " now: " +
                        TimeUtils.toInstant(TimeUtils.now()));
      System.exit(0);
    }
  }
}
Açıklaması şöyle
That was an uncommon trap due to unstable_if at bytecode index 161. In other words, when main was JIT compiled, HotSpot did not produce code for the else branch, because it was never executed before (such a speculative dead code elimination). However, to retain correctness of the compiled code, HotSpot places a trap to deoptimize and fall back to the interpreter, if the speculative condition fails. This is exactly what happens in your case when if condition becomes false.
deoptimization Event Nedir?
deoptimization derleyicinin kodu makine koduna çevirirken hata yaptığını anlayıp tekrar byte-code'a dönmesi anlamına gelir. Açıklaması şöyle
The compiler optimizes aggressively using heuristics as well. In case a guess was actually wrong (e.g. the seemingly unused branch was called at some point), the compiler will deoptimize the code again and may revisit this path later using more profiling data.
Açıklaması şöyle
The process of converting an compiled (or more optimized) stack frame into an interpreted (or less optimized) stack frame. Also describes the discarding of an nmethod whose dependencies (or other assumptions) have been broken. Deoptimized nmethods are typically recompiled to adapt to changing application behavior. Example: A compiler initially assumes a reference value is never null, and tests for it using a trapping memory access. Later on, the application uses null values, and the method is deoptimized and recompiled to use an explicit test-and-branch idiom to detect such nulls.


Hiç yorum yok:

Yorum Gönder