2020-03-16

nvidia gpu编译器

2020-03-13

Kaleidoscope conclusion

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl10.html 的学习。

练习

总结本次实现待改进点：

基本功能：
1 补充测试用例（lexer，loc等）,新增系统测试以覆盖更复杂的场景（可以直接参考llvm代码中的test/Examples/Kaleidoscope/Chapter4.test）
2 loc中filename冗余消除
3 ast的dump功能
4 智能指针的使用改进，shared_ptr是否能转回unique？能否实现自动类型转换，避免大量的get？
5 添加简单的入参解析流程，help信息改进
6 建立行缓冲，在遇到解析错误时，把错误行打印出来，帮助调试
7 头文件中using namespace，可能导致污染，需要去掉。
8 调试信息bugfix，var/for中的变量还未进行声明，operator = 的信息生成还有问题 (已修复)
9 尝试接入方舟的Maple IR，尝试实现Vistor模式
…
把教程中的一些严重问题，如for的语义，调试信息不正确的问题邮件反馈。

新的有趣扩展

1 添加global variables实现
2 添加类型系统typed variables
3 添加arrays, structs, vectors的支持，练习LLVM getelementptr instruction的使用
3 实现辅助的runtime功能，例如IO？
4 内存管理memory management
5 异常支持exception handling support

2020-03-10

Kaleidoscope debug info

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl09.html 的学习。

练习

看懂原网页后，实现同样的功能。

扩展

1 调试信息生成时，提到Kaleidoscope语言的abi接近C的abi。从哪里可以明确这一点？

2 source_loc的实现是否合理？有无改进空间？
string的存储？

3 能否将core_lib中的operator 先放到parser中解析，完成后再解析输入的用户文件？通过合并ast后再codegen，应该可以静默的实现语言扩展的operator。

4 var变量中的变量是否正确生成了调试信息

进展

为减少工作了，剔除原示例悄悄添加的ast dump功能（文字没有介绍，代码新增了）。
实现了debug信息的添加，并修复了原实现中的逻辑错误。

实现

实现中的主要改动如下：

1 出于节约工作量考虑，删除了原示例中的AST的dump功能，暂未实现；

2
原文实现时的调试信息发射有问题，会导致部分指令的调试信息错误。
如下binary op的发射代码所示，原示例在函数头部emitLocation，其信息会马上被随后的LHS/RHS codegen覆盖（他们也会emitLocation）。这样一来，真正属于operator的CreateFAdd指令会位于最后一个emitLocation指向的location(也就是RHS的location)。

Value *BinaryExprAST::codegen() {
  KSDbgInfo.emitLocation(this);
......
  Value *L = LHS->codegen();
  Value *R = RHS->codegen();
  if (!L || !R)
    return nullptr;

  switch (Op) {
  case '+':
    return Builder.CreateFAdd(L, R, "addtmp");
...
  }

使用如下的方法编译llvm中的示例代码。

1
2

cd llvm-project/llvm/examples/Kaleidoscope/Chapter9
 g++ toy.cpp -I ../include/ -I ../../../../../install/usr/local/include/ -L../../../../../install/usr/local/lib `../../../../../install/usr/local/bin/llvm-config --libs` -pthread -ldl  -lz -ltinfo -fno-rtti -o toy

然后用下面的测试代码测试toy程序(输入后ctrl+D结束程序)

def binary , 1 (left  right) right

extern kout(x)
def te(y)
kout(y),
y=5,
kout(y)

def main()
	te(2)

可以获得其输出的LLVM-IR打印如下。

define double @te(double %y) !dbg !13 {
entry:
  %y1 = alloca double
  call void @llvm.dbg.declare(metadata double* %y1, metadata !17, metadata !DIExpression()), !dbg !18
  store double %y, double* %y1
  %y2 = load double, double* %y1, !dbg !19
  %calltmp = call double @kout(double %y2), !dbg !19
  store double 5.000000e+00, double* %y1, !dbg !20
  %binop = call double @"binary,"(double %calltmp, double 5.000000e+00), !dbg !20
  %y3 = load double, double* %y1, !dbg !21
  %calltmp4 = call double @kout(double %y3), !dbg !21
  %binop5 = call double @"binary,"(double %binop, double %calltmp4), !dbg !21
  ret double %binop5, !dbg !21
}
...
!13 = distinct !DISubprogram(name: "te", scope: !2, 
!20 = !DILocation(line: 6, column: 3, scope: !13)
!21 = !DILocation(line: 7, column: 6, scope: !13)

可以看到te函数中的两个’,’ operator，其对应的行号都指向了rhs的位置，完全和源代码对不上。
使用我们的实现编译代码（因为已经内置了’,’，去掉了其定义）。

extern kout(x)

def te(y)
kout(y),
y=5,
kout(y)

def main()
	te(2)

获得的输出如下。

efine double @te(double %y) !dbg !3 {
entry:
  %y1 = alloca double
  store double %y, double* %y1, !dbg !9
  call void @llvm.dbg.declare(metadata double* %y1, metadata !8, metadata !DIExpression()), !dbg !10
  %y2 = load double, double* %y1, !dbg !11
  %callkout = call double @kout(double %y2), !dbg !12
  store double 5.000000e+00, double* %y1, !dbg !13
  %"_binary_,_with_prio_1" = call double @"_binary_,_with_prio_1"(double %callkout, double 5.000000e+00), !dbg !12
  %y3 = load double, double* %y1, !dbg !14
  %callkout4 = call double @kout(double %y3), !dbg !15
  %"_binary_,_with_prio_15" = call double @"_binary_,_with_prio_1"(double %"_binary_,_with_prio_1", double %callkout4), !dbg !13
  ret double %"_binary_,_with_prio_15", !dbg !13
}
...
!3 = distinct !DISubprogram(name: "te", scope: !1, file: !1, line: 3, type: !4, scopeLine: 3, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !0, retainedNodes: !7)
...
!12 = !DILocation(line: 4, column: 8, scope: !3)
!13 = !DILocation(line: 5, column: 4, scope: !3)

可以看到，修正后’,’的位置(参考)可以与源代码吻合。
%”binary,with_prio_1” = call double @”_binary,_with_prio_1”(double %callkout, double 5.000000e+00), !dbg !12 这里说明第一个’,’的location在!dbg !12中给出。
而!12 = !DILocation(line: 4, column: 8, scope: !3)准确给出了，’,’在代码的第4行第8列(scope: !3可以继续看到其属于te函数)。

3 为token也添加了location信息，ast的location从token中获取，不直接与lexer打交道层级更清晰。

4 新增了调试信息发射的控制流程和开关变量

5
在发射函数的IR时，我们的实现为了方便控制调试信息的发射，
将调试信息的发射拆分成了两块。args的调试信息在args的store指令之后发射。
这会导致verifyFunction是发生下面错误。

Expected no forward declarations!
!6 = <temporary!> !{}
  store double %x, double* %x1, !dbg !7
  store double %y, double* %y2, !dbg !7
  call void @llvm.dbg.declare(metadata double* %x1, metadata !8, metadata !DIExpression()), !dbg !9
  call void @llvm.dbg.declare(metadata double* %y2, metadata !10, metadata !DIExpression()), !dbg !9

使用 def foo (x y) x+y即可复现。
看起来dbg.declare需要放到对应store指令的前面。
参考https://stackoverflow.com/questions/34236034/how-to-track-down-llvm-verifyfunction-error-expected-no-forward-declarations/60656058#60656058后，
在添加verifyFunction前添加finalizeSubprogram，可更正错误。

扩展问题

1 调试信息生成时，提到Kaleidoscope语言的abi接近C的abi。从哪里可以明确这一点？
语言目前没有明确设计ABI，在LLVM-IR的生成过程中，其实也不需要配置这些内容。当需要具体生成代码时，LLVM的处理流程会用默认值来工作。对于函数的 calling conventions 来说，可以用setCallingConv方法来专门进行设置。通过追踪设置函数可以看到，其初始值应该为0。

  void setCallingConv(CallingConv::ID CC) {
    auto ID = static_cast<unsigned>(CC);
    assert(!(ID & ~CallingConv::MaxID) && "Unsupported calling convention");
    setValueSubclassData((getSubclassDataFromValue() & 0xc00f) | (ID << 4));
  }
--->
    void setValueSubclassData(unsigned short D) {
    Value::setValueSubclassData(D);
  }
--->
    void setValueSubclassData(unsigned short D) { SubclassData = D; }
--->
  /// Hold arbitrary subclass data.
  ///
  /// This member is defined by this class, but is not used for anything.
  /// Subclasses can use it to hold whatever state they find useful.  This
  /// field iweizhis initialized to zero by the ctor.
  unsigned short SubclassData;

0值对应的意义可以在llvm/IR/CallingConv.h中找到

/// A set of enums which specify the assigned numeric values for known llvm
/// calling conventions.
/// LLVM Calling Convention Representation
enum {
  /// C - The default llvm calling convention, compatible with C.  This
  /// convention is the only calling convention that supports varargs calls.
  /// As with typical C calling conventions, the callee/caller have to
  /// tolerate certain amounts of prototype mismatch.
  C = 0,

这样看起来，原文的说法是基本正确的。在没有设置ABI的情况下，LLVM应该是用了C的配置作为默认值。

2 source_loc的实现是否合理？有无改进空间？
为了简单，source_loc当前存在大量的冗余信息。
至少其中大量重复的filename string应该合并到一个上，改用idx指向一个vector。
最终的方案可能是参考gcc等成熟编译器，将source_loc整个设计为一个idx，要取用的时候再组装成完整的信息。内部储存时，可以合并冗余的string，甚至还可以进一步采用压缩编码方式来记录行号和列号(例如使用基础值+偏移值的方式来记录)。

3 能否将core_lib中的operator 先放到parser中解析，完成后再解析输入的用户文件？通过合并ast后再codegen，应该可以静默的实现语言扩展的operator。
可以，但是不能通过合并ast来实现。
当前用户自定义operator的功能需要lexer的支持，parse正式代码时lexer不知道新增了哪些operator，会导致unknown token的出现。
目前实现的方案是，在parser中静默导入了自定义operator的extern声明，这样用户可以像使用内置operator一样直接使用这些扩展operator。自定义operator的实现放到了core_support_lib中，封装脚本会链入实现。
当前实现的主要问题是，operator很多都是短小语句，应该inline优化的，但是拆开成库的形式后，只有lto优化才能达到效果。
后续可能的改进是，直接把def定义灌入parser，通过直接修改lexer的loc信息(或者先关掉调试信息输出生成operator定义部分，再codegen剩下的部分)，解决调试信息的冲突问题。

4 var变量中的变量是否正确生成了调试信息
没有，原示例var和for中的变量都没有做declare，所以没有对应的调试信息。需要参考args中的处理方法，逐个添加。
出于工作量考虑，本次实现也暂时还未添加这些调试信息。

实现中遇到的问题

1 ranged loop 内部定义的变量，无法跨过循环体保存值。如下代码

#include <iostream>
using namespace std;
int main()
{
        for (auto x : {1,2,3})
                {int i = 0; cout << i++ << ":" << x <<endl;}
        return 0;
}

输出的结果将是

1
2
3

0:1
0:2
0:3

而不是预期的1:1,2:2,3:3。并且，打开Wall -Wextra时也没有告警。。。

2
实现时再次测试了using namespace std;在头文件中的作用范围

#include <iostream>
namespace xx{using namespace std;}
namespace xx{void tt() {string x;}}
namespace xx1{void tt() {string x;}}

会报如下错误。说明头文件中的using会污染和其相同的命名空间。控制using namespace的作用范围仍然是一个有意义的功能。

1 2	tt.cpp:4:26: error: ‘string’ was not declared in this scope namespace xx1{void tt() {string x;}}

2020-03-10

Kaleidoscope object generation

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl08.html 的学习。

练习

看懂原网页后，实现同样的功能。

扩展

1 原示例使用了通用的机器模型进行编译，能否实现针对本地机器的更细粒度优化？

2 能否以较低代价实现支持交叉编译?

进展

完成了原示例中的功能。
新增了功能：针对native机器的细粒度优化（打开本地cpu支持的特性）。
新增了基于环境变量的选项控制功能：默认生成object文件后，可以用选项控制保留LLVM-IR中间文件。

实现

实现的两个主要变更：
1
原示例基于通用cpu的特性来生成代码，优化没有充分利用本地cpu的能力。本次实现改为了基于native去探测本地cpu的能力，选择最合适的指令（相当于使用l了-march=native）。具体实现过程使用了llvm/CodeGen/CommandFlags.inc中的getCPUStr函数来探测cpu。但是该inc文件似乎并不是一个稳定的开放接口文件，在工程内多次包含启动时会有冲突，同时其设置内部变量的方法也比较粗暴。实现时通过将其隔离到单个cpp文件来规避了该问题，可能并不是最好的解决方法。

2
添加object文件生成功能后，按照编译器的通常约定，将默认输出从LLVM-IR修改为object文件。同时，为了调试编译器本身的逻辑，查看编译过程中生成的LLVM-IR也是很有帮助的。
为了支持这样的控制逻辑，添加了一个基于环境变量的选项控制框架。实现代码如下。

class control_flags
{
	template <typename T>
	struct flag_item
	{
		T flag_val;
		const char* input_env;
		const char* description;
		flag_item(T default_val, const char* env, const char* des) :
			input_env(env), description(des)
		{
			const char *env_val = getenv(input_env);
			if (env_val != nullptr)
			{
				stringstream tmp;
				tmp << env_val;
				tmp >> flag_val;
			}
			else
				flag_val = default_val;
		}
		flag_item(){}
		operator T() {return flag_val;}
	};

public:
#define DECL_FLAG(flag_type, flag_name, default_val, input_env, des) \
	flag_item<flag_type> flag_name;
#include "flags.def"
#undef DECL_FLAG
	control_flags()
	{
#define DECL_FLAG(flag_type, flag_name, default_val, input_env, des) \
	flag_item<flag_type> flag_name##cons(default_val, input_env, des);\
	flag_name = flag_name##cons;
#include "flags.def"
#undef DECL_FLAG
	}
}global_flags;

其基本思路很简单，借鉴自golang编译器的实践。从环境变量中获取输入，避开繁琐的输入选项解析。再利用sstream提供的通用类型转换功能，可以完成大多数情况下的flag数值设置。后续只需要按需增加callback函数做输入的合法性检查即可。
在此框架下添加控制选项只需新增如下一行即可。引用flags时，只需将定义一个全局变量 control_flags global_flag，然后引用global_flag.save_temps等名称即可。

1 2	//变量类型，变量名称，变量默认值，用于控制该变量的环境变量名称，变量作用描述 DECL_FLAG(bool, save_temps, false, "save_temps", "keep intermediate files")

扩展问题

1 原示例使用了通用的机器模型进行编译，能否实现针对本地机器的更细粒度优化？
使用llvm提供的机制即可自动探测本地cpu的能力，细节可参考实现1中的描述。

2 能否以较低代价实现支持交叉编译?
llvm框架中交叉编译是默认配置，本地配置只不过一种特化场景。因此，要支持交叉编译非常容易，只需要新增一个入参指定目标代码的三元组TargetTriple(如x86_64-linux-gnu)即可。但是考虑到新增这个特性后，需要一并添加正确性检查，修改optimizer中的逻辑，工作量稍大。在完成主体工作前，可以稍缓一点实现。

2020-03-02

Kaleidoscope mutable variable

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl07.html 的学习。

练习

看懂原网页后，实现同样的功能。

扩展

1 使用CreateAlloca在function的头部创建栈空间，是否会导致不必要的额外栈空间占用？

进展

完成了原示例的功能，添加了对应的简单测试。
以库的方式添加了’,’和’!=’等核心的operator，尚未自动添加到源代码中。
一种可行的简单方式是以源代码方式直接把这些operator的def直接include到代码的头部，然后再编译。但是，需要考虑这样操作对源代码行号的干扰。
等待后续生成调试信息的章节一并考虑。

实现

实现时做了如下两个主要的改进:
1 var中变量默认数值设置为0，原示例是在LLVM-IR 生成时构造的。从逻辑上看，这个应该是语法层面的规定，不应该放到codegen的流程中决定。本次实现时，在parser中parse_var时将未初始化的变量value设置为了0。codegen时只管按值生成就可以了。
2 在处理var中的变量shadow前面已定义变量的情况时。原示例为了简单，是直接把var中声明的所有变量都缓冲到了OldBindings中，如果没有shadow，则把nullptr缓冲到OldBindings中。生成完body后，直接把OldBindings中的所有条目写回NamedValues中。如下片段所示。

  std::vector<AllocaInst *> OldBindings;
  OldBindings.push_back(NamedValues[VarName]);
  // Remember this binding.
  NamedValues[VarName] = Alloca;
....body_gen....
  // Pop all our variables from scope.
  for (unsigned i = 0, e = VarNames.size(); i != e; ++i)
    NamedValues[VarNames[i].first] = OldBindings[i];

本次实现使用了更安全的find来替代[] operator，同时改进了缓存结构。
通过存储named_var中被shadow变量alloca字段的地址，消除了恢复
named_var时的map查找动作，如下片段所示。

	vector<std::pair<AllocaInst **, AllocaInst *>> saved_name_vec;
....
		if (auto it = named_var.find(var_name); it != named_var.end())
		{
			saved_name_vec.push_back(
				std::make_pair(&(it->second), it->second));
			it->second = var_allocas[i];
		}
		else
			named_var[var_name] = var_allocas[i];
	}
...bodygen....
//恢复named_var
	for (size_t i = 0; i < saved_name_vec.size(); ++i)
	{
		auto old_alloca_addr = saved_name_vec[i].first;
		*old_alloca_addr = saved_name_vec[i].second;
	}

3 实现测试时发现了原示例中存在内存泄漏的可能，如下代码所示。

Value *IfExprAST::codegen() {
  ....
  // Create blocks for the then and else cases.  Insert the 'then' block at the
  // end of the function.
  BasicBlock *ThenBB = BasicBlock::Create(TheContext, "then", TheFunction);
  BasicBlock *ElseBB = BasicBlock::Create(TheContext, "else");
  BasicBlock *MergeBB = BasicBlock::Create(TheContext, "ifcont");
....
  Value *ThenV = Then->codegen();
  if (!ThenV)
    return nullptr;

上面代码片段中，ElseBB和MergeBB创建后没有立即挂入function的链表中。如果函数在类似于ThenV的异常流程中return了，则没有人能释放掉这两个指针了。我们的实现中也有类似问题。
要比较完整的修复该问题，有两个点需要同时考虑：
首先，BB创建后都立即挂到Function中去是否会有不良影响？如果没有，创建就挂上是最好的解决方案；
如果不能立即挂上，除了在函数内考虑释放资源外，还要考虑 ir_builder.CreateBr(merge_bb);等语句在异常发生时可能引用悬空指针的问题。需要重排对这些指针的引用，将其都放到末尾。

扩展问题

1 使用CreateAlloca在function的头部创建栈空间，是否会导致不必要的额外栈空间占用？
使用如下的c语言片段进行了测试。发现clang生成代码时也会把alloca都放到头部，并且生成的x86-64和mips64的代码也都是一次在头部把sp预留够。
这里的权衡可能是动态扩展栈变量无法节约多少内存，但是会浪费操作sp的指令，另外也使得分析stack frame变得更为困难，得不偿失。

extern int x ;
void test()
{
        char buf[1024] = {2};
        if (x)
        {
                char nb[1024]= {1};
        }
}

实现中遇到的问题

1 map 的operator[]会改写原map
在改写for代码生成流程中，named_var map 记录和恢复idt var的部分时，
注意到了下面一对代码。

	auto old_val = named_var[idt_name];
	named_var[idt_name] = idt_var;
.....
	if (old_val != nullptr)
		named_var[idt_name] = old_val;
	else
		named_var.erase(idt_name);

这段逻辑是直接从原示例中拷贝过来的。
重构时注意到named_var[idt_name]的初始值问题。
当idt_name这个key不在map中时，读取其value的语义是比较模糊的。
参考https://en.cppreference.com/w/cpp/container/map/operator_at ，发现[]这个operator竟然会静默的insert，即使这个operator是用在取右值的动作中。
用下面的示例，可以较为直观展示出这个出人意料的行为。

#include <map>
#include <iostream>
using namespace std;
int main()
{
	map<int, int*> t1;
	cout << "size before access:" << t1.size() << endl;
	cout << "uninitialized pointer is: " << t1[0] << endl;
	cout << "size after access:" << t1.size() << endl;
	cout << "count after access:" << t1.count(0) << endl;
	return 0;
}

这个程序的输出如下。

size before access:0
uninitialized pointer is: 0
size after access:1
count after access:1

可以看到使用[]访问map时，确实会有insert的动作，并且当key不在map中时，返回的vale是一个默认初始化的值。
在这样的语义下，原示例的代码片段虽然不会导致严重的逻辑错误，但是仍有两个明显的问题。
第一，引入了冗余的insert动作，拖慢了编译器工作速度。
第二，在退出清理map时，如果初始key不存在就不会erase。这样一来map中会残留一个错误的映射项目idt_var –> nullptr。后续流程如果直接使用count这类存在性接口去测试，会得到错误的结果。这是一个潜在的错误来源。

2 std::move作用于const vector不生效
如下测试代码

#include <iostream>
#include <vector>
using namespace std;
class testc_ref
{
public:
    vector<int> inner;
    testc_ref(vector<int>& in) : inner(std::move(in)) {}
};

class testc_move
{
public:
    vector<int> inner;
    testc_move(vector<int> in) : inner(std::move(in)) {}
};

class testc_const_ref
{
public:
    vector<int> inner;
    testc_const_ref(const vector<int>& in) : inner(std::move(in)) {}
};

int main()
{
    vector<int> mytest(100000, 1);
    cout << "org:" << &(mytest[0]) << endl;
    testc_move t1(mytest);
    cout << "t1:"<< &(t1.inner[0]) << endl;
    cout << &(mytest[0]) << endl;

    testc_move t2(std::move(mytest));
    cout << "t2:"<< &(t2.inner[0]) << endl;
    cout << &(mytest[0]) << endl;


    vector<int> mytest1(100000, 1);
    cout << "org:" << &(mytest1[0]) << endl;
    testc_ref t3(mytest1);
    cout << "t3:"<< &(t3.inner[0]) << endl;
    cout << &(mytest1[0]) << endl;

    vector<int> mytest2(100000, 1);
    cout << "org:" << &(mytest2[0]) << endl;
    testc_const_ref t4(mytest2);
    cout << "t4:"<< &(t4.inner[0]) << endl;
    cout << &(mytest2[0]) << endl;

    return 0;
}

上面程序的输出为。

org:0x7f39d2d5f010
t1:0x7f39d2cfd010
0x7f39d2d5f010
t2:0x7f39d2d5f010
0
org:0x7f39d2c9b010
t3:0x7f39d2c9b010
0
org:0x7f39d2c39010
t4:0x7f39d1e45010
0x7f39d2c39010

==可以看出按值传参时需要在调用点和内部都用move才能避免内存分配。==
==按引用传参时，只需在子函数内move即可。==
==而以const 引用传参时，move不会生效，并且也不会有任何告警。==

3 如何使用string_view作为key来访问string为key的map
需要使用map<string, int, less<>>这样的方式建立map，否则find必须使用string做key。
当使用map<string, int, less<>> 建立map时，调用find实际上是把string_view透传到std:less这个模板中。
后续在find的过程中，less会把map中每一个待比较的string转为string_view，然后进行两个string_view的比较。
如果是调用find的时候，把string_view先转为string再传入，则find内部就不再需要构建临时object。
使用下面的示例进行对比测试，使用string_view的版本由于有string转string_view的过程，其速度要略微慢于string的版本。下面是g++-7 -O2编译的结果。性能差距在1%以内。
./t3 string_view
102400
Time taken by function: 1254672 microseconds
./t4 string版本
102400
Time taken by function: 1243754 microseconds

#include <map>
#include <string>
#include <string_view>
#include <iostream>
#include <chrono> 
using namespace std;
using namespace std::chrono; 
int main()
{
    string_view key = "hello";   //string key = "hello"
    map<string, int, less<>> coll;  //map<string, int> coll;
    for (int i=0;i < 102400;i++)
        coll.insert(make_pair(to_string(i),i));
    cout << coll.size()<< endl;
    long unsigned rep = 1024*1024 * 10;
    auto re = coll.find(key);
    auto start = high_resolution_clock::now(); 
    while (rep--)
    {
        re = coll.find(key); // ok
    }
    auto stop = high_resolution_clock::now(); 
    auto duration = duration_cast<microseconds>(stop - start); 
    cout << "Time taken by function: "
        << duration.count() << " microseconds" << endl; 
    return 0;
}

2020-03-02

ELFGO 编译和测试

编译过程

下载代码

从https://github.com/pytorch/ELF 下载代码
在ELF的根目录下git submodule sync && git submodule update --init --recursive获取第三方代码

使用docker方式进行构建

为减少对本地系统的冲击，可以使用docker 方式构建。
先进入ELF的根目录。
在编译前，需要对Dockerfile做一些小调整。
在原来的conda install后新增如下一行
RUN conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
然后使用如下命令进行构建 (为了下载提速，挂了代理)
sudo docker build --network host  --build-arg HTTP_PROXY=socks5://127.0.0.1:1080 --build-arg HTTPS_PROXY=socks5://127.0.0.1:1080 -t elf_go:vtrunk .

运行时问题解决

启动制作好的image，拷贝到host中，以便使用GPU

sudo docker images 查看刚刚构建出的镜像id
启动如下的示例命令启动image，然后将我们需要的内容拷贝到host上

1 2	sudo docker run -dit --name elf_go ${your_image_id} sudo docker cp elf_go:/go-elf ${your_host_elf_dir}

准备host机的路径

由于docker构建时使用了root用户，构建出的binary依赖了一些根路径下的目录。有很多办法可以解决这类问题。
治本的方案是修改编译体系，去除这些对根目录的依赖，将一个用户可控制目录作为根目录(或者使用相对路径)。但是修改的工作量比较大。
规避的办法也有一些，使用patchelf可以修改一些搜索路径，也可以考虑使用fakeroot等技术重定向文件访问的路径。修改的工作量还是略大。
这里用最粗暴的方法，用mount bind来建立一个满足访问要求的映射路径。
先行建立所需的路径。

1 2	sudo mkdir -p /root/miniconda3 sudo mkdir /go-elf

进入到我们拷贝路径${your_host_elf_dir}，使用如下命令完成映射

1 2	sudo mount -o bind ./miniconda3/ /root/miniconda3 sudo mount -o bind ./go-elf /go-elf

最后设置搜索路径

1	export PATH=$PATH:/root/miniconda3/bin

下载v2版本模型

主线elf版本只能使用v2版本的模型，Dockerfile中下载的v0版本无法使用。可从

https://dl.fbaipublicfiles.com/elfopengo/pretrained_models/pretrained-go-19x19-v2.bin 下载后，放入${your_host_elf_dir}/go-elf/ELF中。

适配python版本

elf构建时docker内使用的python版本可能和host中的不一致。需要修改/go-elf/ELF/scripts/elfgames/go/gtp.sh中的启动命令，确保使用(dockerfile构建时)conda中安装的python版本。

majiang@majiang-All-Series:~/hd/opensource/ELF_GO/builded_env/go-elf/ELF/scripts/elfgames/go$ git diff gtp.sh 
diff --git a/scripts/elfgames/go/gtp.sh b/scripts/elfgames/go/gtp.sh
index 0ca7aba..3291ade 100755
--- a/scripts/elfgames/go/gtp.sh
+++ b/scripts/elfgames/go/gtp.sh
@@ -9,7 +9,7 @@
 MODEL=$1
 shift
 
-game=elfgames.go.game model=df_pred model_file=elfgames.go.df_model3 python3 df_console.py --mode online --keys_in_reply V rv \
+game=elfgames.go.game model=df_pred model_file=elfgames.go.df_model3 python3.7 df_console.py --mode online --keys_in_reply V rv \
     --use_mcts --mcts_verbose_time --mcts_use_prior --mcts_persistent_tree --load $MODEL \
     --server_addr localhost --port 1234 \
     --replace_prefix resnet.module,resnet init_conv.module,init_conv \

如果出现python版本不匹配问题，可能出现一些莫名其妙的模块找不到错误，如下所示。

1	ModuleNotFoundError: No module named 'torch._C'

其原因是，python的动态库对python版本有依赖，参考https://github.com/pytorch/pytorch/issues/574。host机器的python3如果不是conda所用的python3.7就会出现找不到库问题。

启动elf自带的脚本

cd /go-elf/ELF/scripts
source devmode_set_pythonpath.sh
cd /go-elf/ELF/scripts/elfgames/go/
./gtp.sh ../../../pretrained-go-19x19-v2.bin --loglevel off --gpu 0 --num_block 20 --dim 256 --mcts_puct 1.50 --batchsize 256 --mcts_rollout_per_batch 16 --mcts_threads 2 --mcts_rollout_per_thread 8192 --resign_thres 0.05 --mcts_virtual_loss 1

解决代码问题

v.pin_memory()处的cuda out of memory问题

加了print (v)语句。go-elf/ELF/src_py/elf/utils_elf.py 44行
print (v)
if gpu is not None:
with torch.cuda.device(gpu):
v = v.pin_memory()
v.fill_(1)
似乎是偶发故障，后续没有再复现

启动后使用genmove B开始下棋，utils_elf.py 210出现了数据维度不匹配的问题

genmove B会有异常，经过检查可能是下面utils_elf.py 210出现了数据维度不匹配的问题，修改接口后解决。

1 2	#bk[:] = v.squeeze_() bk[:] = v.squeeze()

错误现象

> /go-elf/ELF/src_py/elf/utils_elf.py(192)copy_from()
-> for k, v in this_src.items():
(Pdb) y
*** NameError: name 'y' is not defined
(Pdb) q
Traceback (most recent call last):
  File "df_console.py", line 86, in <module>
    main()
  File "df_console.py", line 79, in main
    GC.run()
  File "/home/majiang/hd/opensource/ELF_GO/builded_env/go-elf/ELF/src_py/elf/utils_elf.py", line 436, in run
    self._call(smem, *args, **kwargs)
  File "/home/majiang/hd/opensource/ELF_GO/builded_env/go-elf/ELF/src_py/elf/utils_elf.py", line 404, in _call
    keys_extra, keys_missing = sel_reply.copy_from(reply)
  File "/home/majiang/hd/opensource/ELF_GO/builded_env/go-elf/ELF/src_py/elf/utils_elf.py", line 192, in copy_from
    for k, v in this_src.items():
  File "/home/majiang/hd/opensource/ELF_GO/builded_env/go-elf/ELF/src_py/elf/utils_elf.py", line 192, in copy_from
    for k, v in this_src.items():
  File "/home/majiang/hd/opensource/ELF_GO/builded_env/miniconda3/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/majiang/hd/opensource/ELF_GO/builded_env/miniconda3/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit

GTP协议问题

gtp脚本启动后可以在文本界面下使用GTP协议的指令下棋了。
但是sabaki围棋前端界面无法正常与eflgo进行GTP通信。
查看GTP协议，是ELFGO的额外打印干扰了协议解析。
参考https://github.com/pytorch/ELF/compare/master...Narsil:master，调整了console_lib.py 中的

1
2
3

    def print_msg(self, ret, msg):
-        print("\n%s %s\n\n" % (("=" if ret else "?"), msg))
+        print("%s %s\n\n" % (("=" if ret else "?"), msg))

启动gtp脚本时，可以添加–loglevel off选项关闭大量额外打印。

封装脚本

为了sabaki等前端能比较方便的启动ELFGO后端，可以使用如下的封装脚本一次完成启动。

#!/bin/bash
#set -x
cur_shell_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd ${cur_shell_dir}

cd ../../../../../
sudo mount -o bind ./miniconda3/ /root/miniconda3
sudo mount -o bind ./go-elf  /go-elf
cd -

export PATH=$PATH:/root/miniconda3/bin
cd ../../
source devmode_set_pythonpath.sh
cd -
./gtp.sh ../../../pretrained-go-19x19-v2.bin --loglevel off --gpu 0 --num_block 20 --dim 256 --mcts_puct 1.50 --batchsize 256 --mcts_rollout_per_batch 16 --mcts_threads 2 --mcts_rollout_per_thread 8192 --resign_thres 0.05 --mcts_virtual_loss 1

遗留问题

在sabaki中，当前elfgo还是只能执黑。如果执白还是会卡死，可能协议文本的输出还有问题。
另外，说明中提到elfgo只能走贴7.5目的规则。

另一个围棋AI，可以给出胜率估计的leela-zero

下载并解压Lizzie.0.7.2.Mac-Linux.zip，然后进入解压后的目录

export http_proxy="socks5://127.0.0.1:1080"
export https_proxy="socks5://127.0.0.1:1080"
git clone https://github.com/gcp/leela-zero
cd leela-zero
git submodule update --init --recursive
sudo apt install libboost-dev libboost-program-options-dev libopenblas-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
sudo apt install cmake g++ libboost-dev libboost-program-options-dev libboost-filesystem-dev opencl-headers ocl-icd-libopencl1 ocl-icd-opencl-dev zlib1g-dev
mkdir build
cmake ../
...
CMake Warning at CMakeLists.txt:129 (message):
  Qt is not found, build for `autogtp` and `validation` is disabled
...
cmake --build .
cd ../../
cp ./leela-zero/build/leelaz ./
chmod +x  ./lizzie.jar
./lizzie.jar

2020-03-02

hexo升级和插入图片

插入图片失败

参考https://blog.csdn.net/u010996565/article/details/89196612 等文章的方法。

1	npm install hexo-asset-image --save

然后在文本中插入

1 2	<img src="/images/图片名" width="50" height="50"> ![](image图片名)

均无法正确显示图片。
hexo sever启动的终端可以看到如下提示

1	update link as:-->/.com//xxxx.png

打开图片的地址也是/.com//xxxx.png。
搜索到如下的网页介绍了类似的问题。
https://blog.csdn.net/xjm850552586/article/details/84101345
但是其提供的方案也无法生效。
考虑到安装 hexo-asset-image时曾经提到需要hexo4.0以上版本，而当前hexo版本是3.9，所以考虑先行升级版本

升级hexo

参考如下链接
https://www.dazhuanlan.com/2020/01/29/5e3165c116e0f/?__cf_chl_jschl_tk__=e7caf23cde91dcc1f9fc6e1006b88d0bf63de758-1584498843-0-AQ7fucDfahofVnETCllMEC0jJ8Wrecp_5-NxLutNm_NE07_6SkOP9aYE2mlgJv6oQ9Z78RVSuvf1uZBXN-AwXQ898WXTdcE5uf90w-_XCgWONaii4YIUO62_6vLfMSqvz4miZBzOpMYwmaop6pjkWnQWyhJIsJ-LkHyQXZPBRAteDLBEWZ1-gkaq3aXBT4DZl6-I2wJ_R1WFiBqHDlBlrr2treoeNxPspJpjB0L5Qzd1-8JtGPUprKEBt2KCKCjAXtWL9NZ6TJb93L6wrwpWsJdY45QkNxfwTYq6NPOPd4H3suwbWr1P_FY2lWzbk7NfwQ
使用如下步骤，将hexo升级到了4.2.0
1 清除node的cache

1	npm cache clean -f

2 安装node版本管理工具 n

1	npm install -g n

3 安装最新稳定版本的node

n stable

// 安装最新版本使用 n latest
// 安装指定版本的使用 n {version}，例如 n 11.2.0
// 删除指定版本的node使用 n rm {version}，例如 n rm 11.2.0
4 更新npm版本

1	npm install npm@latest -g

5 进入blog目录，然后执行

1	npm outdated

6 根据第5步输出，手工记录各个组件的最新版本号，然后逐一在package.json配置文件中修改版本号到最新版本。

7 更新模块

1	npm install --save

8 确认升级结果

hexo -v

升级后的问题解决

升级后图片仍然无法正常显示。
异常现象变为，路径正常时，网页上就直接显示路径文本，而不是显示图片。
而使用部分博客提到的语法，则生成的网页中直接就是空白（查看html文件，整段内容直接被忽略了）。

1	{% asset_img slug [title] %}

考虑到部分博客提到hexo4.0把很多插件并入了核心代码，担心插件中有冲突，于是使用下面的命令卸载掉第三方插件。

1 2	npm remove hexo-asset-link --save npm remove hexo-asset-image --save

然后再在文件中使用markdown的语法，就能正确显示图片了。
并且主页中的图片也能正常显示，应该是hexo4.0把image link并入核心代码后做了改进（pacakge.json中显示多了一个hexo-image-link组件），解决了markdown语法的问题。

1	![Alt Text](图片文件 "Title Text")

2020-02-21

Kaleidoscope user defined operator

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl06.html 的学习。

练习

看懂原网页后，实现同样的功能。

扩展

1 命名时没有限制operator是否能是常规字符(如abc,123等)，会否产生静默的关键字冲突？
2 考虑如何处理两个特殊字符组成的操作符，如==　!=等。
3 示例中的自定义operator是否能作为库使用？如果不能应该如何改进？

进展

完成了原示例自定义binary和unary operator的功能，但是未支持用户定义”unary-“。
尚未支持unary-的原因是为了阻止用户犯错，我们禁止了用户定义使用保护字符（包括内置运算符、分隔符等）的operator。这里的检查规则比较简单，仅仅实现了基于char的比较，实际上过于严格，以至于阻止了用户定义unary-这样的合法要求。后续细化保护规则后(拆分运算符和受保护的字符两种类型，对运算符区分binary和uary类型)，即可支持unary-。

实现

实现过程中发现原示例确实精炼，仅用很少的代码便实现了用户自定义operator的功能。但是，其精炼的代价是该功能仅仅具有演示价值而几乎没有实用价值。
例如，原示例没有在lexer做token分层，parser会直接读取输入的char。这样一来lexer可以非常非常简单。并且，在新增语法功能时，parser由于可以直接穿透处理char，需要新增的配合代码也很少。
但是这样的设计导致代码维护困难，parser中要直接处理token拆分逻辑，一旦语法变复杂，新增功能或更改已有逻辑(直接操作char的地方会很多)就会很困难。
另外，原示例没有错误防护的逻辑，这会导致用户犯错很难定位原因。

实现过程中针对前面扩展中提出的问题，一一进行了改进。详细情况可参考扩展问题一节中的描述。
从实现后的实际情况看，改进引入了相当多的额外代码，工作量较大。作为示例，相较于引入大量的细节流程，原作者使用精简的方案确实是更好的权衡（示例丧失了扩展为真正可实用编译器的潜力，但仍然展现了主要的原理，解决了核心问题）。

扩展问题

1 命名时没有限制operator是否能是常规字符(如abc,123等)，会否产生静默的关键字冲突？
原示例几乎没有对自定义operator做错误防护。这将导致用户错误定义operator，很难察觉和定位到根因。
如下所示，覆盖内置操作符的优先级和语义将导致二义性和混乱。

ready> def binary + 80 (a b) a+b; 
ready> Read function definition:define double @"binary+"(double %a, double %b) {
entry:
  %addtmp = fadd double %a, %b
  ret double %addtmp
}

ready> 1+2*3;
ready> Evaluated to 9.000000

定义保留字符’,’，parser无法理解。

ready> def unary , (a) a+1;
ready> Read function definition:define double @"unary,"(double %a) {
entry:
  %addtmp = fadd double %a, 1.000000e+00
  ret double %addtmp
}
ready> def xy3(a) a+,a;
ready> Error: unknown token when expecting an expression
ready> Error: Unknown variable name

定义保留字符’(‘，parser无法理解。

ready> def unary ( (a) a+1;
ready> Read function definition:define double @"unary("(double %a) {
entry:
  %addtmp = fadd double %a, 1.000000e+00
  ret double %addtmp
}
ready> def xy4(a) a+(1  
ready> ;
Error: expected ')'

为了避免这些错误，在实现时专门针对binary/unary的名称做了正确性校验(verify_operator_sym函数)。
在错误第一现场阻止错误，并给出清晰的提示。

2 考虑如何处理两个特殊字符组成的操作符，如==　!=等。
原示例由于使用了char来作为operator的opcode，实际上就无法支持两字符操作符。但使用单char作为operator的opcode，大大简化了构建operator token的难度。因为只需要在parser中按需读出一个char就可以，无需考虑在lexer中如何正确切分的问题。
为了支持!= 和==等显然有意义的两字符operator，本次实现时在lexer中添加了一个map查找机制，如下所示。

if (cur_token == TOKEN_BINARY || cur_token == TOKEN_UNARY)
{
	if (install_user_defined_operator(input_stream, cur_char))
		return;
}

if (get_user_defined_operator(input_stream, cur_char))
	return;

install_user_defined_operator函数，在binary和unary关键字后识别并注册(就是插入到map中)用户自定义的operator token。
get_user_defined_operator中就可以查找已经注册的用户自定义operator，并给出正确的token类型（用户自定义unary或者binary）。

3 示例中的自定义operator是否能作为库使用？如果不能应该如何改进？
从实现原理上来说，因为operator 也是在proto中解析的，与函数一样。
所以自定义的operator只要遵循先extern申明，再引用的规则，就可以以库的方式正常使用(定义在单独的库文件中，使用时链接二进制文件)。
但是，原示例没有考虑容错问题。
parser在工作时是按照extern声明的优先级工作。如果定义时的优先级和extern时的不一致，则程序的逻辑将静默改变。用户要发现这样的问题可能很困难（这个和C中没有正确包含头文件的情况类似）。
为了解决这个问题，参考C++的方法，把prio作为一个强制信息加入到binary/unary函数的名称中。这样一旦extern声明和def的定义不一致，链接时会找不到函数定义。用户将不能得到可工作（但逻辑错误）的二进制文件。

实现中遇到的问题

1 prototype_tab的key从string切换到string_view后，遇到了内存数据错误的问题。详细记录如下。
使用string_view作为map的key时，我们调用了tab.insert(make_pair(a_string, my_val));。原以为string会自动转为string_view，tab中的key(string_view)数据应该指向a_string。实际发现key的数据指向了一个insert所在的函数栈。
使用下面的小型测试用例即可复现出问题，pair中的key并不是指向我们希望的全局数据。

#include <iostream>
#include <string>
#include <string_view>
using namespace std;
void test_view(pair<string_view, int> p)
{
        cout << "value:"<< p.first.data() << endl;
        cout << "address:" << (void*)p.first.data() << endl;
}

string global("abc");
int main()
{
        cout << "value_global:" << global.data() << endl;
        cout << "address_global:" << (void*)global.data() << endl;
        test_view(make_pair(global,1));
        return 0;
}

使用g++ save-temps -fdump-tree-all-details -std=c++17编译上面的代码，可以比较清楚地看出问题。
这里的核心错误在于：
make_pair是一个模板函数，(为了提供泛型能力)它并不知道（或者提供这个约束）第一个参数应该是string_view。当我们提供一个string作为第一个参数时，test_view(make_pair(global,1));的调用并没有变成我们希望的test_view(make_pair((string_view)global,1))。而是变成了如下的序列。
tmp_pair = make_pair<string, int>(global, 1);
tmp_pair_to_call = pair<string_view, int> (tmp_pair);
test_view(tmp_pair_to_call);
这样传到test_view中pair中的string_view实际上指向栈上创建的临时变量。一旦tmp_pair所在的函数return，string_view中的key就乱掉了（use after free）。

从这次经历看string_view确实是非常容易出错的点，使用时需要特别谨慎。参考 https://alexgaynor.net/2019/apr/21/modern-c++-wont-save-us/ 的如下示例，string_view允许指向临时对象，这确实很容易导致错误，并且很难发现。

int main() {
  std::string s = "Hellooooooooooooooo ";
  std::string_view sv = s + "World\n";
  std::cout << sv;
}

2 切换到clang8编译器后，优化LLVM-IR时程序段错误。
google搜索到了下面的邮件，这是一个gcc/clang的abi兼容性问题。
http://lists.llvm.org/pipermail/llvm-dev/2019-January/129567.html
需要将llvm库用clang重新编译（或者升级到最新的llvm）。
升级到最新的llvm代码后确实解决了该问题。

3 升级到最新的llvm后，遇到了asan报大量indirect leak问题。
仔细检查代码后，发现llvm_optimizer::optimize_function等函数中，
使用了createTargetMachine创建的TargetMachine *。
添加了delete TargetMachine *的语句后，leak告警消失。
老版本为何没报leak，暂时没有进一步核查。

2020-02-18

Kaleidoscope control flow

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl05.html 的学习。

练习

看懂原网页后，实现同样的功能。

扩展

原示例中for引入了循环变量，但是仍然没有引入scope之类的概念，而是直接将所有变量名称放到function的NamedValues map(我们的实现是cur_func_args)中。这个是否会产生命名冲突，是否一定需要建立分级的命名查找表？

进展

完成了原示例的if和for添加。
实现过程中发现原示例的for逻辑有问题，参考if的写法做了重写。

实现

大部分语言使用’,’ 作为顺序求值的符号，后续Kaleidoscope也会增加多语句的支持。因此实现时将原示例中的for表达式中的分隔符从”,” 改为了”:”，将”,”预留出来。

实现时的主要变更是改写了for的语义，具体情况如下：
原示例中的for循环展开时，是先执行完成loop后再做结束检查，这和通常语言的for定义都冲突。
使用原示例的代码，如下循环仍然会打印出数字1

1
2
3

def forWrong()
    for x = 1: x < 1: 1 in
        print(x)

这很明显不是多数人理解的for语义。

实现for的IR生成代码时，重写了其逻辑，修改后的逻辑框架如下：

preheader_bb:
	计算induction var(指示变量)的初始值
	goto to end_check
end_check:
	计算end expr的值end_val是否为true
	if (end_val)
		goto loop
	else
		goto afterloop

loop:
	计算循环体body expr的值
	induction var += step
	goto end_check

after_loop:
	xxxx后续指令

修改后的for工作逻辑与c等语言保持了一致。

扩展问题

在展开for表达式时，原示例代码先保存了for指示变量可能会覆盖的变量名，离开for展开流程时，再恢复原始值（如果没有重名，那就把for定义的指示变量从名称查找map named_var中移除）。这样看来，遵循底层覆盖上层的原则，并且不提供访问上层被覆盖值的机制，那么确实无需专门定义scope的概念。
如下示例：

1
2
3

def xt(i)
    for i = 0 : i < 5 : 1 in
        print(i)

即使在for表达式后，还可以添加其他语句。那么其他语句也可以正常访问到入参i的值。

2020-02-14

Kaleidoscope optmizer

学习内容

完成对 https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl04.html 的学习。

练习

看懂原网页后，实现同样的功能（但将原示例的JIT方式改为传统编译流程）。

扩展

1 在dump出LLVMIR后，能否使用编译好的llvm组件实现编译运行等功能？应该如何做？
2 编写一个driver封装(可以使用shell)，把main函数作为入口编译成可执行程序

进展

完成了原示例中的优化部分，删掉了原示例中的JIT部分。

实现

实现了如下的主要变更：
1 删掉了JIT功能，原实例中的扩展库函数没有实际意义也删掉了（有extern语法就肯定能扩展，与JIT功能无关）
2 新增了模块级别的优化功能，可以对单个源代码文件进行整体优化（如inline等）
3 将Passmanager的实现从示例的legacy切换到了新的实现上
4 实现了输入文件的编译

扩展问题

1 玩具前端codegen到出LLVM-IR后，可以使用LLVM提供的独立opt工具进行优化，优化后的输出可以是LLVM-IR的文本形式(-S参数)也是可以bitcode。然后再使用clang编译成二进制文件，就可以进行通常的链接了。也就是说，如果不介意多一次IR导出和IR导入的性能损耗，其实优化是完全可以不必在toy_compiler中实现的。

2 编写了一个compiler.sh，可以把入参文件编译为可执行文件（需要提供main函数）。添加了一个kout函数专门用于打印函数返回值。