git clone https://github.com/jurobystricky/Netgear-A6210 cd /usr/src/netgear-a6210-2.5.0/ make sudo make install
DKMS Install
On Debian-based distros, you can add the module to DKMS so it will automatically build and install on each successive kernel upgrade. To do this, issue the following commands from within the repo’s folder:
../../gdb/nat/x86-linux-dregs.c:146: internal-error: void x86_linux_update_debug_registers(lwp_info*): Assertion `lwp_is_stopped (lwp)' failed. A problem internal to GDB has been detected,
在intel corei7 5960X cpu 上,gcc5到gcc10的性能几乎没有变化(性能提升2%以下), 这说明编译器的通用优化技术最近几年几乎没有进步。 这对编译器从业者来说是一个很悲哀的结论。 An Intel Core i7 5960X Haswell-E system was used for testing rather than a newer CPU in order to rule out back-end/micro-architecture specific optimizations across the tested compilers. Intel Haswell has offered tuned GCC support since before the GCC 5 release. Ubuntu 19.10 was running on this Core i7 5960X system with the Linux 5.3 kernel. https://www.phoronix.com/scan.php?page=article&item=gcc5-gcc10-benchmarks&num=4
make -C ../apps/cpp_rpc CXX=/home/majiang/hd/opensource/android_sdk/android-ndk-r21/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android24-clang++ TVM_RUNTIME_DIR=/home/majiang/hd/opensource/tvm/build_arm64/
adb shell cd /data/local/tmp export LD_LIBRARY_PATH=`pwd` ./tvm_rpc
如果正常,应该能看到help信息,示例如下。
1 2 3 4 5 6 7 8 9
[10:17:53] main.cc:289: Command line usage server - Start the server --host - The hostname of the server, Default=0.0.0.0 --port - The port of the RPC, Default=9090 --port-end - The end search port of the RPC, Default=9199 --tracker - The RPC tracker address in host:port format e.g. 10.1.1.2:9190 Default="" --key - The key used to identify the device type in tracker. Default="" --custom-addr - Custom IP Address to Report to RPC Tracker. Default="" --silent - Whether to run in silent mode. Default=False
启动rpc服务,进行测试
启动cpp版本的rpc后,测试其功能是否正常。 首先在host主机上启动rpc tracker。使用如下命令。 应该会看到”INFO:RPCTracker:bind to 0.0.0.0:9190”这样的提示。
Server List ---------------------------- server-address key ---------------------------- 192.168.3.33:38151 server:android ----------------------------
Queue Status ------------------------------- key total free pending ------------------------------- android 1 1 0 -------------------------------
SUMMARY: AddressSanitizer: heap-buffer-overflow (/data/local/tmp/libclang_rt.asan-aarch64-android.so+0x85f4c) Shadow bytes around the buggy address: 0x001ed4dd8dd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001ed4dd8de0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001ed4dd8df0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001ed4dd8e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x001ed4dd8e10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x001ed4dd8e20:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x001ed4dd8e30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x001ed4dd8e40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x001ed4dd8e50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x001ed4dd8e60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x001ed4dd8e70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==5713==ABORTING
voidReserve(size_t n){ if (ring_.size() < n) { //扩大ring buffer的size } } elseif (ring_.size() > n * 8 && ring_.size() > kInitCapacity && bytes_available_ > 0) { // shrink too large temporary buffer to avoid out of memory on some embedded devices size_t old_bytes = bytes_available_;
std::vector<char> tmp(old_bytes);
Read(&tmp[0], old_bytes); //ring_.resize(kInitCapacity); this may cause overflow when n>kInitCapacity ring_.resize(kInitCapacity > n? kInitCapacity : n); ring_.shrink_to_fit();
/* Enable custom logging - this will cause TVM to pass every log message * through CustomLogMessage instead of LogMessage. By enabling this, we must * implement dmlc::CustomLogMessage::Log. We use this to pass TVM log * messages to Android logcat. */ #define DMLC_LOG_CUSTOMIZE 1
/* Ensure that fatal errors are passed to the logger before throwing * in LogMessageFatal */ #define DMLC_LOG_BEFORE_THROW 1
#include <android/log.h>
void dmlc::CustomLogMessage::Log(const std::string& msg) { // This is called for every message logged by TVM. // We pass the message to logcat. __android_log_write(ANDROID_LOG_DEBUG, "TVM_RUNTIME", msg.c_str()); }
make -C ../apps/cpp_rpc CXX=/mnt/d/opensource/android_ndk/android-ndk-r21b/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android28-clang++ TVM_RUNTIME_DIR=/mnt/d/opensource/opensrc_tvm/tvm/build_arm64
File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__ self._handle = _dlopen(self._name, mode) OSError: /home/majiang/opensrc/tvm/build/libtvm.so: undefined symbol: spvContextDestroy Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program
Important Features Interactive kernel profiler and API debugger Graphical profile report Result comparison across one or multiple reports within the tool Fast Data Collection UI and Command Line interface Fully customizable reports and analysis rules
The major challenge is to make the schedule space complete enough to cover the state of art kernels for hardware back-ends we want to support, specifically gpu and other hardwares. The second challenge is to build the dsl representation to cover things we care about in deep learning(e.g. recurrence). The other issues include the ease of deployment and interpolation.
These challenges are not well addressed by existing frameworks(including Halide) and requires rethink and design of the stack as opposed to simply reuse an existing one.
You can also find that the TVM’s IR itself is evolving, and we continuously learn new lessons from hand optimization and tuning for various backends.
I think what you answered reflects MLIR’s vision. Make the abstract class of IR and derive dialects. But not necessarily provide specific pass for the dialect, so if X-IR is a dialect of MLIR, then there are dialect specific passes that is needed in the pass.
Polyhedral dialect is a dialect in MLIR. In the current case, the polyhedral IR is part of the mlir codebase, which gives the view of “native”, but non-the-less it is a dialect just like the other automatic optimization dialect. The fact that it is part of the native code base does give an opinionated view of what what automatic optimization should be like in MLIR ecosystem. I think it is still very much an open problem, TVM has done a lot in this direction, and we can collectively innovate on this area. How TVM can work with MLIR
First of all, MLIR won’t make TVM obsolete. In the contrary, it can help TVM stack by providing insights in IR design and possibly some lowering infrastructure.The community will keep improving our current IR infrastructure toward a better unified TVM-IR infra. We will try to define TVM dialects in MLIR to see if it makes sense to allow bi-directional translation between MLIR and TVM-IR, this way we can take benefit of some of the infra provided by MLIR and make TVM work together with MLIR’s ecosystem.
from tvm.contrib.debugger import debug_runtime as graph_runtime
具体实现可参考src/runtime/graph/debug/graph_runtime_debug.cc中的RunIndividual函数。其实核心就是对每一个op运行计时。 输出的有价值信息主要包括两类: a 是在后台输出按执行顺序排列的op运行时间(如果是notebook,这个信息不会出现在浏览器中,会出现在启动notebook的终端) b 是在notebook中打印按耗时占比排序的op执行时间
获取各个pass执行后的IR
使用git bisect定位问题
在tvm的目录下: 使用git bisect start开始二分查找 然后使用git bisect good $commit_id和git bisect bad ${commit_id}指定搜索区间。 就可以反复使用 cd build;cmake ../ -DCMAKE_BUILD_TYPE=Debug ; make -j 6构建并运行tvm,观察行为是否正常。 如果正常就git bisect good,如果异常就git bisect bad;如果中途某个版本遇到其他问题(例如还有其他bug干扰),可以使用git bisect skip。 找到问题后,使用git bisect reset还原。
//这个示例用Halide完成了一个矩阵 intmain(int argc, char **argv){ /*这里声明了三个核心概念 Func 是一系列运算(expr)的合集 Var 表达运算中涉及的变量 Expr 表达单个运算过程 */ Halide::Func gradient; Halide::Var x, y; Halide::Expr e = x + y; //这里才完成了函数定义 f(x,y) = x+ y gradient(x, y) = e; /* 上面的声明式定义,其实已经体现出Halide是一个新的语言了。 有几个比较细节的点: 1 注意到我们没有对Var x和y进行赋值,就直接在expr中使用它们了。 可以这样做的原因是,它们只是对应二维数组的两个轴向而已,并不代表具体的值。 2 Halide::Expr e = x + y; 中的'='和'+'显然都不是常规语义。这一句实际上构建了一个Halide expr IR节点,op为+,LHS是x,RHS是y; 3 gradient(x, y) = e; 把expr关联到函数上,同样对应了IR上的操作。 */
//调用realize完成了编译和运行,并得到了结果 Halide::Buffer<int32_t> output = gradient.realize(800, 600); //下面只是校验Halide和通常的运算结果一致 for (int j = 0; j < output.height(); j++) { for (int i = 0; i < output.width(); i++) { if (output(i, j) != i + j) { printf("Something went wrong!\n" "Pixel %d, %d was supposed to be %d, but instead it's %d\n", i, j, i + j, output(i, j)); return-1; } } } printf("Success!\n"); return0; }
lower函数的过程,可以看到大致有两个主要的工作,第一个是补充最终程序需要的系列流程,如初始化环境,建立循环,已经插入一些等等;第二个是进行各项高层优化(优化越接近源码,执行起来越简单。)。但是lower部分看到结尾,仍然没有向另外一种IR或者机器指令转换。 从lower返回后,在compile_jit函数中继续向下调试,可以最终找到如下堆栈回溯中,Halide完成了IR到LLVM-IR的codegen过程(当然如果结合代码分析,查找LLVM的相关流程,找到这里会更快)。 #0 Halide::Internal::CodeGen_LLVM::compile (this=0x5555557b60e0, input=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/CodeGen_LLVM.cpp:637 #1 0x00007ffff399a42c in Halide::codegen_llvm (module=…, context=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/CodeGen_LLVM.cpp:46 #2 0x00007ffff3c326e1 in Halide::compile_module_to_llvm_module (module=…, context=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/LLVM_Output.cpp:381 #3 0x00007ffff3c13c91 in Halide::Internal::JITModule::JITModule (this=0x7fffffffbf90, m=…, fn=…, dependencies=std::vector of length 0, capacity 0) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/JITModule.cpp:251 #4 0x00007ffff3cb86dd in Halide::Pipeline::compile_jit (this=0x7fffffffd920, target_arg=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/Pipeline.cpp:607 #5 0x00007ffff3cbbda7 in Halide::Pipeline::realize (this=0x7fffffffd920, outputs=…, t=…, param_map=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/Pipeline.cpp:1099 #6 0x00007ffff3cb98b0 in Halide::Pipeline::realize (this=0x7fffffffd920, sizes=std::vector of length 2, capacity 2 = {…}, target=…, param_map=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/Pipeline.cpp:703 #7 0x00007ffff3ac078c in Halide::Func::realize (this=0x7fffffffdbf0, sizes=std::vector of length 0, capacity 0, target=…, param_map=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/Func.cpp:2922 #8 0x00007ffff3ac0a7d in Halide::Func::realize (this=0x7fffffffdbf0, x_size=800, y_size=600, target=…, param_map=…) at /media/majiang/c6b38ac3-8b8a-4613-8259-dddbffe2f4cb/majiang/opensource/Halide/src/Func.cpp:2937 #9 0x000055555555d56a in main (argc=1, argv=0x7fffffffdd98) at lesson_01_basics.cpp:78 后面的流程更加直接一些,CodeGen_LLVM.cpp包含了主要的转换内容,compile_func中的 f.body.accept(this); 发起了LLVM-IR的发射动作。 后面就是CodeGen_LLVM.cpp中的一堆visit函数完成了针对不同类型Halide IR的LLVMIR代码生成。
Typically, the base class template will take advantage of the fact that member function bodies (definitions) are not instantiated until long after their declarations, and will use members of the derived class within its own member functions, via the use of a cast; e.g.: