Merge pull request #13 from brainsqueeze/dev

Version 2.0 improvements and deprecated removals
brainsqueeze · Jul 7, 2022 · c00e3a8 · c00e3a8
2 parents 13ee809 + 3b974a5
commit c00e3a8
Show file tree

Hide file tree

Showing 29 changed files with 746 additions and 1,676 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,7 @@
 **/*.dev.yml
 
 wiki_t2v/
+multi_news_t2v*/
 
 # JavaScript configs and dependencies
 **/.eslintrc.json

diff --git a/README.md b/README.md
@@ -5,20 +5,16 @@ Models for contextual embedding of arbitrary texts.
 ## Setup
 ---
 
-To get started, one should have a flavor of TensorFlow installed, with
-version `>=2.4.1`. One can run
+To get started, one should have a flavor of TensorFlow installed, with version `>=2.4.1`. One can run
 ```bash
 pip install tensorflow>=2.4.1
 ```
-If one wishes to run the examples, some additional dependencies
-from HuggingFace will need to be installed. The full installation
-looks like
+If one wishes to run the examples, some additional dependencies from HuggingFace will need to be installed. The full installation looks like
 ```bash
 pip install tensorflow>=2.4.1 tokenizers datasets
 ```
 
-To install the core components as an import-able Python library
-simply run
+To install the core components as an import-able Python library simply run
 
 ```bash
 pip install git+https://github.com/brainsqueeze/text2vec.git
@@ -27,119 +23,46 @@ pip install git+https://github.com/brainsqueeze/text2vec.git
 ## Motivation
 ---
 
-Word embedding models have been very beneficial to natural 
-language processing. The technique is able to distill semantic 
-meaning from words by treating them as vectors in a 
-high-dimensional vector space.
-
-This package attempts to accomplish the same semantic embedding, 
-but do this at the sentence and paragraph level. Within a 
-sentence the order of words and the use of punctuation and 
-conjugations are very important for extracting the meaning 
-of blocks of text.
-
-Inspiration is taken from recent advances in text summary 
-models (pointer-generator), where an attention mechanism 
-[[1](https://arxiv.org/abs/1409.0473)] is 
-used to extrude the overall meaning of the input text. In the 
-case of text2vec, we use the attention vectors found from the 
-input text as the embedded vector representing the input. 
-Furthermore, recent attention-only approaches to sequence-to-sequence 
-modeling are adapted.
-
-**note**: this is not a canonical implementation of the attention 
-mechanism, but this method was chosen intentionally to be able to 
-leverage the attention vector as the embedding output.
+Word embedding models have been very beneficial to natural language processing. The technique is able to distill semantic meaning from words by treating them as vectors in a high-dimensional vector space.
+
+This package attempts to accomplish the same semantic embedding, but do this at the sentence and paragraph level. Within a sentence the order of words and the use of punctuation and conjugations are very important for extracting the meaning of blocks of text.
+
+Inspiration is taken from recent advances in text summary models (pointer-generator), where an attention mechanism [[1](https://arxiv.org/abs/1409.0473)] is used to extrude the overall meaning of the input text. In the case of text2vec, we use the attention vectors found from the input text as the embedded vector representing the input. Furthermore, recent attention-only approaches to sequence-to-sequence modeling are adapted.
+
 
 ### Transformer model
 ---
 
-This is a tensor-to-tensor model adapted from the work in 
-[Attention Is All You Need](https://arxiv.org/abs/1706.03762). 
-The embedding and encoding steps follow directly from 
-[[2](https://arxiv.org/abs/1706.03762)], however a self-
-attention is applied at the end of the encoding steps and a 
-context-vector is learned, which in turn is used to project 
-the decoding tensors onto.
-
-The decoding steps begin as usual with the word-embedded input 
-sequences shifted right, then multi-head attention, skip connection 
-and layer-normalization is applied. Before continuing, we project 
-the resulting decoded sequences onto the context-vector from the 
-encoding steps. The projected tensors are then passed through 
-the position-wise feed-forward (conv1D) + skip connection and layer- 
-normalization again, before once more being projected onto the 
-context-vectors.
+This is a tensor-to-tensor model adapted from the work in [Attention Is All You Need](https://arxiv.org/abs/1706.03762). The embedding and encoding steps follow directly from [[2](https://arxiv.org/abs/1706.03762)], however a self-attention is applied at the end of the encoding steps and a context-vector is learned, which in turn is used to project the decoding tensors onto.
+
+The decoding steps begin as usual with the word-embedded input sequences shifted right, then multi-head attention, skip connection and layer-normalization is applied. Before continuing, we project the resulting decoded sequences onto the context-vector from the encoding steps. The projected tensors are then passed through the position-wise feed-forward (conv1D) + skip connection and layer-normalization again, before once more being projected onto the context-vectors.
 
 ### LSTM seq2seq
 
-This is an adapted bi-directional LSTM encoder-decoder model with 
-a self-attention mechanism learned from the encoding steps. The 
-context-vectors are used to project the resulting decoded sequences 
-before computing logits.
+This is an adapted bi-directional LSTM encoder-decoder model with a self-attention mechanism learned from the encoding steps. The context-vectors are used to project the resulting decoded sequences before computing logits.
 
 
 ## Training
 ---
 
-Both models are trained using Adam SGD with the learning-rate decay 
-program in [[2](https://arxiv.org/abs/1706.03762)].
+Both models are trained using Adam SGD with the learning-rate decay program in [[2](https://arxiv.org/abs/1706.03762)].
 
-The pre-built auto-encoder models inherit from [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model), and as such they can be trained using the `fit` method. 
-An example of training on Wikitext data is available in the [examples folder](./examples/trainers/wiki_transformer.py). This uses HuggingFace [tokenizers](https://huggingface.co/docs/tokenizers/python/latest/) and [datasets](https://huggingface.co/docs/datasets/master/).
+The pre-built auto-encoder models inherit from [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model), and as such they can be trained using the [fit method](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit). Training examples are available in the [examples folder](./examples/trainers). This uses HuggingFace [tokenizers](https://huggingface.co/docs/tokenizers/python/latest/) and [datasets](https://huggingface.co/docs/datasets/master/).
 
-Training the LSTM model can be initiated with
+If you wish to run the example training scripts then you will need to clone the repository
 ```bash
-text2vec_main --run=train --yaml_config=/path/to/config.yml
-```
-The training configuration YAML for attention models must look like
-```yaml
-training:
-  tokens: 10000
-  max_sequence_length: 512
-  epochs: 100
-  batch_size: 64
-
-model:
-  name: transformer_test
-  parameters:
-    embedding: 128
-    layers: 8
-  storage_dir: /path/to/save/model
-```
-The `parameters` for recurrent models must include at least 
-`embedding` and `hidden`, which referes to the dimensionality of the hidden LSTM layer. The `training` section of the YAML file can also include user-defined sentences to use as a context-angle evaluation set. This can look like
-```yaml
-eval_sentences:
-  - The movie was great!
-  - The movie was terrible.
-```
-It can also include a `data` tag which is a list of absolute file paths for custom training data sets. This can look like
-```yaml
-data_files:
-  - ~/path/to/data/set1.txt
-  - ~/path/to/data/set2.txt
-  ...
+git clone https://github.com/brainsqueeze/text2vec.git
 ```
-
-Likewise, the transformer model can be trained with 
+and then run either 
 ```bash
-text2vec_main --run=train --attention --yaml_config=/path/to/config.yml
+python -m examples.trainers.news_transformer
 ```
-
-To view the output of training you can then run
+for the attention-based transformer, or 
 ```bash
-tensorboard --logdir text_embedding
+python -m examples.trainers.news_lstm
 ```
+for the LSTM-based encoder. These examples use the [Multi-News](https://github.com/Alex-Fabbri/Multi-News) dataset via [HuggingFace](https://huggingface.co/datasets/multi_news).
 
-If you have CUDA and cuDNN installed you can run 
-`pip install -r requirements-gpu.txt`. 
-The GPU will automatically be detected and used if present, otherwise 
-it will fall back to the CPU for training and inferencing.
-
-### Mutual contextual orthogonality
-
-To impose quasi-mutual orthogonality on the learned context vectors simply add the `--orthogonal` flag to the training command. This will add a loss term that can be thought of as a Lagrange multiplier where the constraint is self-alignment of the context vectors, and orthogonality between non-self vectors. The aim is not to impose orthogonality between all text inputs that are not the same, but rather to coerce the model to learn significantly different encodings for different contextual inputs.
 
 ## Python API
 
@@ -149,43 +72,43 @@ Text2vec includes a Python API with convenient classes for handling attention an
 
 #### Auto-encoders
 
-  - [text2vec.autoencoders.TransformerAutoEncoder](/text2vec/autoencoders.py#L13)
-  - [text2vec.autoencoders.LstmAutoEncoder](/text2vec/models/transformer.py#L134)
+  - [text2vec.autoencoders.TransformerAutoEncoder](/text2vec/autoencoders.py#L12)
+  - [text2vec.autoencoders.LstmAutoEncoder](/text2vec/models/transformer.py#L190)
 
 #### Layers
 
-  - [text2vec.models.TransformerEncoder](/text2vec/models/transformer.py#L11)
-  - [text2vec.models.TransformerDecoder](/text2vec/models/transformer.py#L81)
-  - [text2vec.models.RecurrentEncoder](/text2vec/models/sequential.py#L8)
-  - [text2vec.models.RecurrentDecoder](/text2vec/models/sequential.py#L61)
+  - [text2vec.models.TransformerEncoder](/text2vec/models/transformer.py#L8)
+  - [text2vec.models.TransformerDecoder](/text2vec/models/transformer.py#L78)
+  - [text2vec.models.RecurrentEncoder](/text2vec/models/sequential.py#L9)
+  - [text2vec.models.RecurrentDecoder](/text2vec/models/sequential.py#L65)
 
 #### Input and Word-Embeddings Components
 
-  - [text2vec.models.Tokenizer](/text2vec/models/components/feeder.py#L4)
-  - [text2vec.models.Embed](/text2vec/models/components/text_inputs.py#L4)
-  - [text2vec.models.TokenEmbed](/text2vec/models/components/text_inputs.py#L82)
-  - [text2vec.models.TextInput](/text2vec/models/components/feeder.py#L35) (DEPRECATED)
+  - [text2vec.models.Tokenizer](/text2vec/models/components/text_inputs.py#L5)
+  - [text2vec.models.Embed](/text2vec/models/components/text_inputs.py#L36)
+  - [text2vec.models.TokenEmbed](/text2vec/models/components/text_inputs.py#L116)
 
 #### Attention Components
 
-  - [text2vec.models.components.attention.ScaledDotAttention](/text2vec/models/components/attention.py#L4)
-  - [text2vec.models.components.attention.SingleHeadAttention](/text2vec/models/components/attention.py#L111)
-  - [text2vec.models.MultiHeadAttention](/text2vec/models/components/attention.py#L175)
-  - [text2vec.models.BahdanauAttention](/text2vec/models/components/attention.py#L53)
+  - [text2vec.models.components.attention.ScaledDotAttention](/text2vec/models/components/attention.py#L7)
+  - [text2vec.models.components.attention.SingleHeadAttention](/text2vec/models/components/attention.py#L115)
+  - [text2vec.models.MultiHeadAttention](/text2vec/models/components/attention.py#L179)
+  - [text2vec.models.BahdanauAttention](/text2vec/models/components/attention.py#L57)
 
 #### LSTM Components
 
-  - [text2vec.models.BidirectionalLSTM](/text2vec/models/components/recurrent.py#L4)
+  - [text2vec.models.BidirectionalLSTM](/text2vec/models/components/recurrent.py#L5)
 
 #### Pointwise Feedforward Components
 
   - [text2vec.models.PositionWiseFFN](/text2vec/models/components/feed_forward.py#L4)
 
 #### General Layer Components
 
-  - [text2vec.models.components.utils.LayerNorm](/text2vec/models/components/utils.py#L5)
+  - [text2vec.models.components.utils.LayerNorm](/text2vec/models/components/utils.py#L6)
   - [text2vec.models.components.utils.TensorProjection](/text2vec/models/components/utils.py#L43)
-  - [text2vec.models.components.utils.PositionalEncder](/text2vec/models/components/utils.py#L76)
+  - [text2vec.models.components.utils.PositionalEncder](/text2vec/models/components/utils.py#L77)
+  - [text2vec.models.components.utils.VariationPositionalEncoder](/text2vec/models/components/utils.py#L122)
 
 #### Dataset Pre-processing
 
@@ -207,17 +130,17 @@ Text2vec includes a Python API with convenient classes for handling attention an
 ## Inference Demo
 ---
 
-Once a model is fully trained then a demo API can be run, along with a small 
-UI to interact with the REST API. This demo attempts to use the trained model 
-to condense long bodies of text into the most important sentences, using the 
-inferred embedded context vectors.
-
+Trained text2vec models can be demonstrated from a lightweight app included in this repository. The demo runs extractive summarization from long bodies of text using the attention vectors of the encoding latent space. To get started, you will need to clone the repository and then install additional dependencies:
+```bash
+git clone https://github.com/brainsqueeze/text2vec.git
+cd text2vec
+pip install flask tornado
+```
 To start the model server simply run 
 ```bash
-text2vec_main --run=infer --yaml_config=/path/to/config.yml
+python demo/api.py --model_dir /absolute/saved_model/parent/dir
 ```
-A demonstration webpage is included in [demo](demo) at 
-[context.html](demo/context.html).
+The `model_dir` CLI parameter must be an absolute path to the directory containing the `/saved_model` folder and the `tokenizer.json` file from a text2vec model with an `embed` signature. A demonstration app is served on port 9090.
 
 ## References
 ---

diff --git a/demo/api.py b/demo/api.py
@@ -0,0 +1,124 @@
+from typing import List, Union
+from math import pi
+import argparse
+import json
+import re
+
+from flask import Flask, request, Response, send_from_directory
+from tornado.log import enable_pretty_logging
+from tornado.httpserver import HTTPServer
+from tornado.wsgi import WSGIContainer
+from tornado.ioloop import IOLoop
+import tornado.autoreload
+import tornado
+
+import tensorflow as tf
+from tensorflow.keras import models, Model
+from tokenizers import Tokenizer
+
+app = Flask(__name__, static_url_path="", static_folder="./")
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, help="Directory containing serialized model and tokenizer", required=True)
+args = parser.parse_args()
+
+model: Model = models.load_model(f"{args.model_dir}/saved_model")
+tokenizer: Tokenizer = Tokenizer.from_file(f"{args.model_dir}/tokenizer.json")
+
+
+def responder(results, error, message):
+    """Boilerplate Flask response item.
+
+    Parameters
+    ----------
+    results : dict
+        API response
+    error : int
+        Error code
+    message : str
+        Message to send to the client
+
+    Returns
+    -------
+    flask.Reponse
+    """
+
+    assert isinstance(results, dict)
+    results["message"] = message
+    results = json.dumps(results, indent=2)
+
+    return Response(
+        response=results,
+        status=error,
+        mimetype="application/json"
+    )
+
+
+def tokenize(text: Union[str, List[str]]) -> List[str]:
+    if isinstance(text, str):
+        return [' '.join(tokenizer.encode(text).tokens)]
+    return [' '.join(batch.tokens) for batch in tokenizer.encode_batch(text)]
+
+
+def get_summaries(paragraphs: List[str]):
+    context = tf.concat([
+        model.embed(batch)["attention"]
+        for batch in tf.data.Dataset.from_tensor_slices(paragraphs).batch(32)
+    ], axis=0)
+    doc_vector = model.embed(tf.strings.reduce_join(paragraphs, separator=' ', keepdims=True))["attention"]
+    cosine = tf.tensordot(tf.math.l2_normalize(context, axis=1), tf.math.l2_normalize(doc_vector, axis=1), axes=[-1, 1])
+    cosine = tf.clip_by_value(cosine, -1, 1)
+    likelihoods = tf.nn.softmax(180 - tf.math.acos(cosine) * (180 / pi), axis=0)
+    return likelihoods
+
+
+@app.route("/")
+def root():
+    return send_from_directory(directory="./html/", path="index.html")
+
+
+@app.route("/summarize", methods=["GET", "POST"])
+def summarize():
+    if request.is_json:
+        payload = request.json
+    else:
+        payload = request.values
+
+    text = payload.get("text", "")
+    if not text:
+        return responder(results={}, error=400, message="No text provided")
+
+    paragraphs = [p for p in re.split(r"\n{1,}", text) if p.strip()]
+    if len(paragraphs) < 2:
+        return responder(results={"text": paragraphs}, error=400, message="Insufficient amount of text provided")
+
+    tokenized = tokenize(paragraphs)
+    likelihoods = get_summaries(tokenized)
+    likelihoods = tf.squeeze(likelihoods)
+    cond = tf.where(likelihoods > tf.math.reduce_mean(likelihoods) + tf.math.reduce_std(likelihoods)).numpy().flatten()
+    output = [{
+        "text": paragraphs[idx],
+        "score": float(likelihoods[idx])
+    } for idx in cond]
+
+    results = {"data": output}
+    return responder(results=results, error=200, message="Success")
+
+
+def serve(port: int = 9090, debug: bool = False):
+    http_server = HTTPServer(WSGIContainer(app))
+    http_server.listen(port)
+    enable_pretty_logging()
+
+    io_loop = IOLoop.current()
+    if debug:
+        tornado.autoreload.start(check_time=500)
+    print("Listening to port", port, flush=True)
+
+    try:
+        io_loop.start()
+    except KeyboardInterrupt:
+        pass
+
+
+if __name__ == '__main__':
+    serve()