Finished updating lingress notebook

wrmthorne · Mar 13, 2023 · 625d2e1 · 625d2e1
1 parent ade4192
commit 625d2e1
Show file tree

Hide file tree

Showing 2 changed files with 185 additions and 437 deletions.
diff --git a/Advanced 1 - Linear Regression/linear_regression_notebook.ipynb b/Advanced 1 - Linear Regression/linear_regression_notebook.ipynb
@@ -27,7 +27,7 @@
    "outputs": [],
    "source": [
     "import sys\n",
-    "!wget https://raw.githubusercontent.com/wrmthorne/linear-regression/main/requirements.txt\n",
+    "!wget https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/requirements.txt\n",
     "!{sys.executable} -m pip install -r requirements.txt"
    ]
   },
@@ -105,7 +105,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -123,7 +123,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -167,7 +167,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -250,7 +250,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/linear-regression/main/olympic_data.csv', encoding='unicode_escape')\n",
+    "df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/olympic_data.csv', encoding='unicode_escape')\n",
     "\n",
     "# Print some information about the dataframe\n",
     "print(df.info())"
@@ -273,7 +273,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -431,7 +431,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -451,7 +451,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "xlsx = pd.ExcelFile('https://raw.githubusercontent.com/wrmthorne/linear-regression/main/experiment_data.xlsx')\n",
+    "xlsx = pd.ExcelFile('https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/experiment_data.xlsx')\n",
     "sheet_names = xlsx.sheet_names\n",
     "\n",
     "run1 = pd.read_excel(xlsx, sheet_names[0], index_col=0)\n",
@@ -477,7 +477,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -537,7 +537,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -557,7 +557,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Answer Here"
    ]
   },
   {
@@ -593,7 +593,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/linear-regression/main/olympic_data.csv', encoding='unicode_escape')\n",
+    "df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/olympic_data.csv', encoding='unicode_escape')\n",
     "data = df[['Year', 'Result']].loc[df.Event == '100M Women'].mask(df.eq('None')).dropna()\n",
     "data = data.sort_values(by=['Year'])\n",
     "\n",
@@ -661,10 +661,23 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Answer here"
+    "# Sample answer\n",
+    "X_basis = polynomial(x, num_basis=len(x))\n",
+    "lin_reg = linear_model.LinearRegression()\n",
+    "lin_reg.fit(X_basis, y)\n",
+    "y_pred = lin_reg.predict(X_basis)\n",
+    "\n",
+    "plt.plot(x, y, 'rx', label='Data')\n",
+    "plt.plot(x, y_pred, label='Prediction')\n",
+    "plt.title('Medal Times for Women\\'s Olympic 100m')\n",
+    "plt.xlabel('Year')\n",
+    "plt.ylabel('Time (Seconds)')\n",
+    "plt.legend()\n",
+    "plt.show()"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
@@ -680,104 +693,15 @@
     "\n",
     "There are two ways of minimising the loss w.r.t $m$ and $c$. It can be solved using linear algebra (the method which will be covered here) or it can be performed iteratively, that is, by updating the values by a incremental amount ([learning rate](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/)) at each iteration of a loop. This is known as [gradient descent](https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21) and is the primary algorithm used on extremely large models where there is too much data to store in memory at any one time to solve algebraically.\n",
     "\n",
-    "All standard models of linear regression will make use of matrix multiplication. If you haven't used matrices before, [khan academy](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:mat-intro/v/introduction-to-the-matrix) has a really good series on them. We will first cover how to use matrices in numpy."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Define matrix A and matrix B (2D numpy arrays)\n",
-    "A = np.random.randint(1, 5, size=(4, 4))\n",
-    "B = np.random.randint(1, 5, size=(4, 4))\n",
-    "\n",
-    "print(A, end='\\n\\n')\n",
-    "print(B)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The dot and outer products of matrices can be performed in numpy simply. Dot product can be performed using `np.dot()` (or `@` but it is very [slightly different](https://stackoverflow.com/questions/34142485/difference-between-numpy-dot-and-python-3-5-matrix-multiplication) in some cases) and the outer product can be calculated with `np.outer()`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# np.dot() and @ are basically equivalent\n",
-    "print(np.dot(A, B), end='\\n\\n')\n",
-    "print(A @ B, end='\\n\\n')\n",
-    "\n",
-    "print(np.outer(A, B))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Matrices can have their inverse inverse and transpose applied"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Transposition\n",
-    "print(A, end='\\n\\n')\n",
-    "print(A.T, end='\\n\\n')\n",
-    "\n",
-    "# Inversion\n",
-    "print(np.linalg.inv(A))"
+    "All standard models of linear regression will make use of matrix multiplication. If you haven't used matrices before, [khan academy](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:mat-intro/v/introduction-to-the-matrix) has a really good series on them. If you need a recap on how to use matrices in NumPy, we covered them in our [numpy workshop](https://github.com/wrmthorne/SWiCS-WIE-Intermediate-Python-Worksheets/tree/main/Intermediate%201%20-%20Numpy%20and%20Plotting)."
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now show how summation across a formula is equivalent to matrix multiplication when use in a specific way. If we define a really large vector C and we want to find the sum of the sqares of all elements in C, the same operation can be achieved much faster using matrix multiplication:\n",
-    "\n",
-    "$$\n",
-    "\\sum_{i=1}^{n}c_i^2 = C^T \\cdot C\n",
-    "$$\n",
-    "\n",
-    "This difference may not seem important for a simple calculation like this but by using matrix multiplication, we avoid iteration which is a major bottle neck in computation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Reduce the size if this cell takes more than a couple of seconds\n",
-    "C = np.random.randint(1, 5, size=10000000)\n",
-    "\n",
-    "# Summation \n",
-    "start_time = time.time()\n",
-    "summation = sum(C**2)\n",
-    "print(f'Summation: {time.time() - start_time:.4f}s')\n",
-    "\n",
-    "# Matrix multiplication\n",
-    "start_time = time.time()\n",
-    "mat_mul = np.dot(C.T, C)\n",
-    "print(f'Matrix Multiplication: {time.time() - start_time:.4f}s')\n",
-    "\n",
-    "print(f'Equivalent?: {summation == mat_mul}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now that we have gone through the basics of matrices in numpy, we can start to look at how we can apply this to linear regression. First, we need to understand how we can convert objective function into a vectorised form. We can take our original objective function and stack the two parameters into a weight matrix $\\mathbf{w}$:\n",
+    "First, we need to understand how we can convert objective function into a vectorised form. We can take our original objective function and stack the two parameters into a weight matrix $\\mathbf{w}$:\n",
     "\n",
     "$$\n",
     "\\mathbf{w} = \\begin{bmatrix} c \\\\ m \\end{bmatrix}\n",
@@ -852,6 +776,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "np.random.seed(42)\n",
+    "A = np.random.randint(0, 10, (4, 4))\n",
+    "B = np.random.randint(0, 10, (4, 4))\n",
+    "\n",
     "# Automatic linear algebra solving\n",
     "w = np.linalg.solve(A, B)\n",
     "\n",
@@ -1011,12 +939,13 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Making a Prediction\n",
     "\n",
-    "In 2020, E. Thompson-Herah won the women's 100m with a time of 10.61s. Let's see how our predictions line up with that"
+    "In 2020, E. Thompson-Herah won the women's 100m with a time of 10.61s. Let's see how our predictions line up with that. Although the error of basis 3 is higher, by the next olympics, a pure linear prediction will be far more inaccurate. In about 100 years, the world record sprint time will drop below 0s which can't happen."
    ]
   },
   {
@@ -1040,11 +969,8 @@
   }
  ],
  "metadata": {
-  "interpreter": {
-   "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
-  },
   "kernelspec": {
-   "display_name": "Python 3.8.10 64-bit",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -1058,9 +984,14 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.10"
+   "version": "3.10.6"
   },
-  "orig_nbformat": 4
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
+   }
+  }
  },
  "nbformat": 4,
  "nbformat_minor": 2