Skip to content

2....a)Hardware:②FBTA

Tingyuan LIANG edited this page Aug 28, 2018 · 29 revisions

Buddy Tree without AND-Gate Tree:

Previous buddy tree allocators can get trapped in the search of the first zero bit in the OR-gate bit-vector. In [1], the AND-gate and the OR-gate trees are described as combinational logic resulting in long critical path. The work in [2] inserted registers to break the critical path but led to hundreds of steps to perform the search. Without AND-gate tree in [2], FBTA extracts the first zero bit via direct bitwise operations on the OR-gate BV [3].

Moreover, FBTA involves parallelism in the maintenance of the buddy tree, which also reduces the latency of maintenance. Additionally, FBTA can hide the latency of maintenance from the application accelerator.


Allocation Stage Based on bitwise and arithmetic operators:

By inverting the bit-vector into , the first-zero search problem is turned into finding the lowest set bit of . By subtracting 1 from , we can clear the lowest set bit of but all the other one bits in remain set. Thus, consists of only the lowest set bit of . These operations can be grouped into the following equation:

in the equation above will indicate the lowest zero with 1 while the other bits will be 0.


For example,


To make this procedure clear, here is a complete example. Suppose we need to find the lowest set bit of 'b10010101100.

A = 'b10010101100

B = A - 'b1 = 'b10010101011

C = A & B = 'b10010101000

D = A - C = 'b00000000100

compare A, B and C

A = 'b10010101100

B = 'b10010101011

C = 'b10010101000

D = 'b00000000100


can be translated into the actual location number of the first zero bit in the bit-vector , with a MUX-based base 2 logarithm calculator. In this way, the latency of searching the first zero bit decreases significantly from the time complexity of . For instance, to find the first zero bit in a 64-bit vector, the latency of allocation stage based on FBTA is only 4 cycles at 100MHz. By grouping these 64-bit vectors and searching concurrently among them, FBTA can manage hundreds of MAUs much quicker. Note that the location of layer and the communication between accelerator and allocator cost 4 cycles so that the overall latency of allocation stage will be 8 cycles.


The figure below visualize the procedure of allocation based on FBTA:


Parallelism between Allocator and Accelerator

Moreover, in previous works, the accelerator needed to wait for the allocator to finish not only allocation but also maintenance, as shown in the figure below.

However, since the maintenance stage of allocator does not rely on any information from the accelerator, overlapping the maintenance of allocator with the subsequent actions of the accelerator can hide the latency of maintenance stage from the accelerator, as shown in the figure below.


Maintenance Stage Based on Parallelism:

In the implementation of [2], marking upwards is done after marking downwards. However, these two procedures actually process different objects and there is no dependence between them. Thus, in FBTA, they are executed concurrently. The work in [2] inserted registers between stages manually which may not be the optimal solution and can be replaced by HLS optimized solution. In Hi-DMM, the process of maintenance is pipelined by using HLS directive thus achieving higher hardware parallelism, which could be hard to implemented at RTL level by manual analysis. With these improvements, as an instance, to maintain a buddy tree with depth of 7, it only costs FBTA 7 cycles on average at 100MHz. Below is an example of maintenance, where parallelism can be noticed.


References:

[1] Chang, J. Morris, and Edward F. Gehringer. "A high-performance memory allocator for object-oriented systems." IEEE Transactions on Computers 3 (1996): 357-366.

[2] Xue, Zeping, and David B. Thomas. "SysAlloc: A hardware manager for dynamic memory allocation in heterogeneous systems." Field Programmable Logic and Applications (FPL), 2015 25th International Conference on. IEEE, 2015.

[3] GNU. (2018) C library. [Online]. Available: http://ftp.gnu.org/gnu/glibc/glibc-2.27.tar.xz