You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In tensorflow's world, its low-level implementor should split all of types into the following two: non-TF_String and TF_String.
Because tensorflow treats TF_String not like the traditional languages, it's an element with variable-length. For example, the string "yorkie is so cute" is a TF_String element, correspondingly, the character "x" is also a TF_String one.
What's a tensor
Let's start with the keyword "tensor". The tensor is actually a data structure, also as a multidimensional array. That I have to say, an array, the vector in mathematical, is the edge case of the tensor structure.
For example, an tensor could be represents as:
100[1,2,3][[2,3],[5,7]][[[1],[2],[3],[4]]]
In the above example, every line is a tensor. The top number is called scalar tensor, and next is vector, matrix and n-tensor. The n represents the dimension of your tensor or array as:
scalar: n = 0
vector: n = 1
matrix: n = 2
Formally, I'm going to introduce the concept of shape array, and show how it works with the parameter n. Every tensor's structure is shaped by an vector, and the number n is the length of this vector.
And from the start position to the end, every element describes what's size in its specific dimension. The shape [5] describes a vector which owns 5 elements, [3, 2] describes a matrix which owns 3 sub-vectors, which owns 2 scalar elements, that the total number of elements is 3 * 2 = 6.
Take a more complex example, the shape [100, 99, 5, 5] represents a tensor which owns a 100 elements, which's shape is [99, 5, 5] as a matrix, the total elements number of this tensor is 100 * 99 * 5 * 5.
Store a tensor
In last section, we have covered what's a tensor, and how to represent it in an human-readable way. Next, we will take a look at how to store a tensor in machine.
The above structure describes the 3 fields. In fact, all the real data is put to the field buffer, it's a fixed array in storage, and we could call the field type and shape as the metadata of the buffer:
type describes the element size
shape describes how to encoding and decoding with buffer
Oh, string has no fixed size
As we have written in the beginning of this note, tensorflow treats the string in a variable-length type, that's the problem of the encoding method util now.
To represent the tensor composed with string correctly, implementor should introduce another array, offsets indices, to tell the encoder/decoder that every element's size. The offsets indices is a uint64 array, and its size is the number of elements. For example, if we have a string tensor:
"foobar", "yorkie is so cute"
The encoder just writes normally as before, the only difference is that we should put the start position of every string into the "offsets indices" vector. Corresponding, we decoding the buffer by reading this vector, "offsets indices" as well.
Here we have a C API story, actually in tensorflow's C API, there are 4 relevant functions thats:
TF_StringEncode
TF_StringDecode
TF_EncodeStrings
TF_DecodeStrings
And the TF_EncodeStrings and TF_DecodeStrings are not exposed. In my first attempt to implement string tensor encode/decode at yorkie/tensorflow-nodejs@ce922f7, I misunderstood the full implementation are inside those functions, I got an error as:
Malformed TF_STRING tensor; element 0 out of range
After reading the function TF_Tensor_DecodeStrings and TF_Tensor_EncodeStrings, the "offsets indices" logic is defined there, and got to know that TF_StringEncode/TF_StringDecode is for another purpose. Then I re-implement the "offsets indices" in JavaScript in my own implementation.
Summary
In this note, I share about how tensorflow stores data internally, alternatively put a little story with its C API, too. If you are going to implement a tensorflow client, this might help you to build the basis of your library.
I have implemented the encoding and decoding at tensorflow-nodejs, if you are interested in getting more details, take a look at the following links:
In tensorflow's world, its low-level implementor should split all of types into the following two: non-
TF_String
andTF_String
.Because tensorflow treats
TF_String
not like the traditional languages, it's an element with variable-length. For example, the string "yorkie is so cute" is aTF_String
element, correspondingly, the character "x" is also aTF_String
one.What's a tensor
Let's start with the keyword "tensor". The tensor is actually a data structure, also as a multidimensional array. That I have to say, an array, the vector in mathematical, is the edge case of the tensor structure.
For example, an tensor could be represents as:
In the above example, every line is a tensor. The top number is called scalar tensor, and next is vector, matrix and n-tensor. The
n
represents the dimension of your tensor or array as:Formally, I'm going to introduce the concept of shape array, and show how it works with the parameter
n
. Every tensor's structure is shaped by an vector, and the numbern
is the length of this vector.And from the start position to the end, every element describes what's size in its specific dimension. The shape
[5]
describes a vector which owns 5 elements,[3, 2]
describes a matrix which owns 3 sub-vectors, which owns 2 scalar elements, that the total number of elements is3 * 2 = 6
.Take a more complex example, the shape
[100, 99, 5, 5]
represents a tensor which owns a 100 elements, which's shape is[99, 5, 5]
as a matrix, the total elements number of this tensor is100 * 99 * 5 * 5
.Store a tensor
In last section, we have covered what's a tensor, and how to represent it in an human-readable way. Next, we will take a look at how to store a tensor in machine.
The above structure describes the 3 fields. In fact, all the real data is put to the field
buffer
, it's a fixed array in storage, and we could call the fieldtype
andshape
as the metadata of thebuffer
:type
describes the element sizeshape
describes how to encoding and decoding with bufferOh, string has no fixed size
As we have written in the beginning of this note, tensorflow treats the string in a variable-length type, that's the problem of the encoding method util now.
To represent the tensor composed with string correctly, implementor should introduce another array, offsets indices, to tell the encoder/decoder that every element's size. The offsets indices is a uint64 array, and its size is the number of elements. For example, if we have a string tensor:
The encoder just writes normally as before, the only difference is that we should put the start position of every string into the "offsets indices" vector. Corresponding, we decoding the buffer by reading this vector, "offsets indices" as well.
Here we have a C API story, actually in tensorflow's C API, there are 4 relevant functions thats:
TF_StringEncode
TF_StringDecode
TF_EncodeStrings
TF_DecodeStrings
And the
TF_EncodeStrings
andTF_DecodeStrings
are not exposed. In my first attempt to implement string tensor encode/decode at yorkie/tensorflow-nodejs@ce922f7, I misunderstood the full implementation are inside those functions, I got an error as:After reading the function
TF_Tensor_DecodeStrings
andTF_Tensor_EncodeStrings
, the "offsets indices" logic is defined there, and got to know thatTF_StringEncode
/TF_StringDecode
is for another purpose. Then I re-implement the "offsets indices" in JavaScript in my own implementation.Summary
In this note, I share about how tensorflow stores data internally, alternatively put a little story with its C API, too. If you are going to implement a tensorflow client, this might help you to build the basis of your library.
I have implemented the encoding and decoding at tensorflow-nodejs, if you are interested in getting more details, take a look at the following links:
The text was updated successfully, but these errors were encountered: