From 79c484b114f63529df6b263e2fe5f10d708988b4 Mon Sep 17 00:00:00 2001 From: Phil Weir Date: Wed, 1 Jul 2020 08:25:31 +0000 Subject: [PATCH] magnum of opera notes --- 022-burette/A Magnum of Opera.ipynb | 1011 +++++++++++++++++++++++++++ 1 file changed, 1011 insertions(+) create mode 100644 022-burette/A Magnum of Opera.ipynb diff --git a/022-burette/A Magnum of Opera.ipynb b/022-burette/A Magnum of Opera.ipynb new file mode 100644 index 0000000..3523fb0 --- /dev/null +++ b/022-burette/A Magnum of Opera.ipynb @@ -0,0 +1,1011 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "source": [ + "# A Magnum of Opera (aka \"A Lot of Work\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(or opuses, or opi, take your pick)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Previously, we focused on Flask and SQLAlchemy - we evolved from our Latency approach, better structuring our project, and leveraging tooling to do so in the Python ecosystem." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rather than accelerating too quickly, I have kept most of the code the same as previously, but focused on a practical example of dependency injection. Firstly, to show how code can be decoupled, in this case with `injector`, and how that can give benefits to separation of concerns and those original DDD ideas of separated layers: domain, application and persistence." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Decorating the Lab" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To demonstrate the power of decoupling - with or without an Inversion of Control container like `injector`'s - we will briefly rewrite the entire stack. Even more, we will switch to a completely different coding paradigm: coroutines. Our domain model and its tests will remain unchanged." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This requires one more Python construct that we have brushed over several times - `decorators`. Note that this is distinct from the _decorator pattern_ , which, in Python, links more to mixins and the wrapper pattern we looked at in previous weeks (reminder: `class MyDict: _real_dict = {}`)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How do decorators work? They take a function or method, and return a tweaked version. Perhaps this means a side-effect, such as recording an action, or a pure mutation, such as adjusting passed arguments or return values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "They can be specified in a couple of ways:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "class StringifyResult:\n", + " # Called when Python hits the decorator in code\n", + " def __init__(self, f):\n", + " self._f = f\n", + " \n", + " # Called when you call the function\n", + " def __call__(self, *args):\n", + " original_result = self._f(*args)\n", + " return str(original_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "@StringifyResult\n", + "def add_numbers(a, b):\n", + " return a + b" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'4'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "add_numbers(1, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A more succint and common, but slightly less intuitive, way of writing this is as a function that returns a function:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "def stringify_result(f):\n", + " # I've given args a and b for clarity, but *args, **kwargs\n", + " # is more common and flexible\n", + " def tweaked_f(a, b):\n", + " original_result = f(a, b)\n", + " return str(original_result)\n", + " \n", + " return tweaked_f" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "@stringify_result\n", + "def add_numbers(a, b):\n", + " return a + b" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'4'" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "add_numbers(1, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We could do an entire session, or series of sessions, or [meetup series](https://twitter.com/belfastfp) on functional programming, but we'll suffice with a brief aside to whet the appetite." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Functional Works" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Functional programming can get very theoretical - there are excellent intros to the [computing theory](https://github.com/hmemcpy/milewski-ctfp-pdf) that give a whole new way about thinking about programming. For the moment:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* functional programming focuses on functions, rather than objects\n", + "* state is local, not global or maintained - values are passed into functions, and returned from them\n", + "* functions can be passed all over the show\n", + "* it focuses on calling functions, nesting right down, rather than long sequences of commands (imperative programming)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python, we rarely take an exclusively functional approach, but venerable languages such as Haskell, Scala, Lisp, Clojure, (and dozens of new ones known only in fashionable coffee shops in Shoreditch) can be very or entirely functional." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, it's important to be aware of - it's the theoretical basis of MapReduce and massively scalable code. If state is not shared, globally or at a class/object level, parallelisation is trivial." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'c': 4, 'a': 4, 'b': 8}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "class Sequence:\n", + " content = 'abcbabbbcbabccba'\n", + " def count_distinct(self):\n", + " distinct = set(self.content)\n", + " return {character: self.content.count(character) for character in distinct}\n", + "\n", + "my_seq = Sequence()\n", + "my_seq.count_distinct()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "vs" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'c': 4, 'a': 4, 'b': 8}" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def count_distinct(content):\n", + " distinct = set(content)\n", + " return {character: content.count(character) for character in distinct}\n", + "\n", + "my_seq = 'abcbabbbcbabccba'\n", + "count_distinct(my_seq)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Why is the second helpful?" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DISTINCTS: [{'c': 1, 'a': 1, 'b': 2}, {'a': 1, 'b': 3}, {'c': 1, 'a': 1, 'b': 2}, {'c': 2, 'a': 1, 'b': 1}]\n" + ] + }, + { + "data": { + "text/plain": [ + "{'c': 4, 'a': 4, 'b': 8}" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import textwrap\n", + "\n", + "def split_content_to_substrings(content):\n", + " return textwrap.fill(content, 4).split()\n", + "\n", + "def compose_distinct(distincts):\n", + " print('DISTINCTS:', list(distincts))\n", + " \n", + " # Work out all occurring characters\n", + " all_characters = set(sum([list(d.keys()) for d in distincts], []))\n", + " \n", + " counts = {c: 0 for c in all_characters}\n", + " # Less functional - this could be map/comprehension, but easier to read!\n", + " for distinct in distincts:\n", + " for c, ct in distinct.items():\n", + " counts[c] += ct\n", + " \n", + " return counts\n", + "\n", + "# a series of composed functions\n", + "compose_distinct(list(map(count_distinct, split_content_to_substrings(my_seq))))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unlike the first, this could be immediately parallelized." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### More Painting" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Many decorators exist:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* `@classmethod`: indicates a class method (first arg is the class, not `self`)\n", + "* `@staticmethod`: indicates a class method (no first arg)\n", + "* `@timeit`: indicates you want this function timed (recall timeit from the Latency exercises)\n", + "* `@contextmanager`: preps a function for use in a with statement\n", + "* `@property`: indicates that you want to expose a getter&/setter as an object property\n", + "* `@pytest.fixture`: indicates that the function it applies to defined an injectable fixture\n", + "* `@app.route`: indicates that the following function is a route-handler in Flask" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note from the last one, that decorators can be written as object methods, as well as functions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Coroutines" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Coroutines in Python started out as decorators - @asyncio.coroutine - but this approach is now deprecated and you will generally see the syntax:" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "async def my_asynchronous_function(a, b):\n", + " return a + b" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This should look like any other function, except the `async`. And, in fact, it really is - except that:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "my_asynchronous_function(1, 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It returns a coroutine. This is a construct that can be scheduled, and may have a result, but not immediately. They can be chained. Again, we'll look at practical examples rather than delving into coroutine theory. However, coroutines work on the principle of an event loop that ensures they actually happen." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import asyncio\n", + "\n", + "# This approach is required if no event loop yet exists - but in Jupyter it does\n", + "# asyncio.run(my_asynchronous_function)\n", + "\n", + "await my_asynchronous_function(1, 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `await` syntax allows us to compose these like ordinary functions - but if some IO, sleep or such is taking time, Python can carry on executing other coroutines that have been scheduled, and switch between them, returning control to the first one when it's blocking step is done." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doing\n", + "3\n", + "Done\n" + ] + } + ], + "source": [ + "async def my_slow_function():\n", + " print('Doing')\n", + " await asyncio.sleep(3)\n", + " print('Done')\n", + " \n", + "loop = asyncio.get_event_loop()\n", + "result = loop.create_task(my_slow_function())\n", + "\n", + "await asyncio.sleep(1)\n", + "other_result = await my_asynchronous_function(1, 2)\n", + "print(other_result)\n", + "await result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An example of where this is useful occurs in HTTP responses..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Opera on Tour" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have explored the Magnum Opus server in a number of different ways, and seen how we can separate business logic from our application." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our main exercise will be to work through the new code, and to adapt it for a new type of database." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Raising the roof" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First a few extra dependencies:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: findspark in /opt/conda/lib/python3.7/site-packages (1.4.2)\n", + "Requirement already satisfied: cassandra-driver in /opt/conda/lib/python3.7/site-packages (3.24.0)\n", + "Requirement already satisfied: flask-cqlalchemy in /opt/conda/lib/python3.7/site-packages (1.2.0)\n", + "Collecting flask\n", + " Downloading Flask-1.1.2-py2.py3-none-any.whl (94 kB)\n", + "\u001b[K |████████████████████████████████| 94 kB 2.7 MB/s eta 0:00:011\n", + "\u001b[?25hRequirement already satisfied: geomet<0.3,>=0.1 in /opt/conda/lib/python3.7/site-packages (from cassandra-driver) (0.2.1.post1)\n", + "Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.7/site-packages (from cassandra-driver) (1.15.0)\n", + "Requirement already satisfied: blist in /opt/conda/lib/python3.7/site-packages (from flask-cqlalchemy) (1.3.6)\n", + "Collecting itsdangerous>=0.24\n", + " Downloading itsdangerous-1.1.0-py2.py3-none-any.whl (16 kB)\n", + "Requirement already satisfied: Jinja2>=2.10.1 in /opt/conda/lib/python3.7/site-packages (from flask) (2.11.2)\n", + "Requirement already satisfied: click>=5.1 in /opt/conda/lib/python3.7/site-packages (from flask) (7.1.2)\n", + "Collecting Werkzeug>=0.15\n", + " Downloading Werkzeug-1.0.1-py2.py3-none-any.whl (298 kB)\n", + "\u001b[K |████████████████████████████████| 298 kB 7.3 MB/s eta 0:00:01\n", + "\u001b[?25hRequirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.7/site-packages (from Jinja2>=2.10.1->flask) (1.1.1)\n", + "Installing collected packages: itsdangerous, Werkzeug, flask\n", + "Successfully installed Werkzeug-1.0.1 flask-1.1.2 itsdangerous-1.1.0\n" + ] + } + ], + "source": [ + "!pip3 install findspark cassandra-driver flask-cqlalchemy flask" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " % Total % Received % Xferd Average Speed Time Time Time Current\n", + " Dload Upload Total Spent Left Speed\n", + " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\n", + "100 8340k 100 8340k 0 0 4398k 0 0:00:01 0:00:01 --:--:-- 7385k\n" + ] + } + ], + "source": [ + "!curl -L -O http://dl.bintray.com/spark-packages/maven/datastax/spark-cassandra-connector/2.4.0-s_2.11/spark-cassandra-connector-2.4.0-s_2.11.jar" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Cassandra is a highly scalable NoSQL database - we have one conveniently set up." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "CASSANDRA_PASSWORD = \"Le2V2gZ0nk\"" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "import os\n", + "import findspark\n", + "\n", + "os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'\n", + "os.environ['CQLENG_ALLOW_SCHEMA_MANAGEMENT'] = '1'\n", + "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2 --conf spark.cassandra.connection.host=cassandra-0 pyspark-shell'\n", + "findspark.init()\n", + "\n", + "import pyspark" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "conf = pyspark.SparkConf()\n", + " \n", + "conf.set(\"spark.jars\", \"./spark-cassandra-connector-2.4.0-s_2.11.jar\") # spark-cassandra-connect jar\n", + " \n", + "conf.set(\"spark.cassandra.connection.host\", \"cassandra\")\n", + "conf.set(\"spark.cassandra.auth.username\", \"cassandra\")\n", + "conf.set(\"spark.cassandra.auth.password\", CASSANDRA_PASSWORD)\n", + "\n", + "\n", + "sc = pyspark.SparkContext(conf=conf)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I don't generally use standard examples, as you'll see them anyway, but this is particularly nice example of how stats and parallelism can let you calculate `pi`:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3.14152816\n" + ] + } + ], + "source": [ + "import random\n", + "\n", + "num_samples = 100000000\n", + "\n", + "def inside(p): \n", + " x, y = random.random(), random.random()\n", + " return x*x + y*y < 1\n", + "\n", + "count = sc.parallelize(range(0, num_samples)).filter(inside).count()\n", + "pi = 4 * count / num_samples\n", + "print(pi)\n", + "sc.stop()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use spark to handle our data that has been inserted into Cassandra." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "from pyspark.sql import SQLContext\n", + "sql_context = SQLContext(sc)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "table_df = sql_context.read\\\n", + " .format(\"org.apache.spark.sql.cassandra\")\\\n", + " .options(table='substance', keyspace='pythoncourse')\\\n", + " .load()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A few hints" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Helpfully, someone created the `flask_cqlalchemy` module to make it easier (not trivial) to use SQLAlchemy style code with Cassandra." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "from cassandra.auth import PlainTextAuthProvider" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "import uuid\n", + "from flask import Flask\n", + "from flask_cqlalchemy import CQLAlchemy\n", + "\n", + "app = Flask(__name__)\n", + "app.config['CASSANDRA_HOSTS'] = ['cassandra']\n", + "app.config['CASSANDRA_KEYSPACE'] = \"pythoncourse\"\n", + "app.config['CASSANDRA_SETUP_KWARGS'] = {'protocol_version': 3, \"auth_provider\": PlainTextAuthProvider(\n", + " username='cassandra', password=CASSANDRA_PASSWORD)}\n", + "db = CQLAlchemy(app)\n", + "\n", + "\n", + "class Substance(db.Model):\n", + " id = db.columns.UUID(primary_key=True, default=uuid.uuid4)\n", + " nature = db.columns.Text()\n", + " state = db.columns.List(db.columns.Text())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [], + "source": [ + "#db.sync_db()" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "522b6440-bdee-4d82-b531-030edb4248a3 Sulphur\n", + "4d76c633-26bb-4ed8-be44-558440d5d7e5 Sulphur\n", + "b6d2e363-30a8-4e5e-92a3-8c7c8391b060 Sulphur\n", + "f7598b0a-1050-4a5e-96be-7785bd8b2f27 Sulphur\n", + "1f1a941b-019a-4f4f-acfd-09968f6e9d03 Sulphur\n", + "a227e5fc-6b81-4a54-a421-183a361368d1 Sulphur\n" + ] + } + ], + "source": [ + "for substance in Substance.objects().all():\n", + " print(substance.id, substance.nature)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What can we do with this? Looking back at our earlier work on counting distinct perhaps..." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "hideCode": false, + "hidePrompt": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
naturecount
0Sulphur6
\n", + "
" + ], + "text/plain": [ + " nature count\n", + "0 Sulphur 6" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table_df.groupby('nature').count().toPandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
distinct_natures
01
\n", + "
" + ], + "text/plain": [ + " distinct_natures\n", + "0 1" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from pyspark.sql.functions import approx_count_distinct, countDistinct\n", + "nature_counts = table_df.agg(approx_count_distinct(table_df.nature).alias('distinct_natures')).collect()\n", + "sql_context.createDataFrame(nature_counts).toPandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
distinct_natures
01
\n", + "
" + ], + "text/plain": [ + " distinct_natures\n", + "0 1" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "nature_counts = table_df.agg(approx_count_distinct(table_df.nature).alias('distinct_natures')).collect()\n", + "sql_context.createDataFrame(nature_counts).toPandas()" + ] + } + ], + "metadata": { + "hide_code_all_hidden": false, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}