Skip to content

Commit

Permalink
Finished updating lingress notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
wrmthorne committed Mar 13, 2023
1 parent ade4192 commit 625d2e1
Show file tree
Hide file tree
Showing 2 changed files with 185 additions and 437 deletions.
155 changes: 43 additions & 112 deletions Advanced 1 - Linear Regression/linear_regression_notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"outputs": [],
"source": [
"import sys\n",
"!wget https://raw.githubusercontent.com/wrmthorne/linear-regression/main/requirements.txt\n",
"!wget https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/requirements.txt\n",
"!{sys.executable} -m pip install -r requirements.txt"
]
},
Expand Down Expand Up @@ -105,7 +105,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand All @@ -123,7 +123,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand Down Expand Up @@ -167,7 +167,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand Down Expand Up @@ -250,7 +250,7 @@
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/linear-regression/main/olympic_data.csv', encoding='unicode_escape')\n",
"df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/olympic_data.csv', encoding='unicode_escape')\n",
"\n",
"# Print some information about the dataframe\n",
"print(df.info())"
Expand All @@ -273,7 +273,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand Down Expand Up @@ -431,7 +431,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand All @@ -451,7 +451,7 @@
"metadata": {},
"outputs": [],
"source": [
"xlsx = pd.ExcelFile('https://raw.githubusercontent.com/wrmthorne/linear-regression/main/experiment_data.xlsx')\n",
"xlsx = pd.ExcelFile('https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/experiment_data.xlsx')\n",
"sheet_names = xlsx.sheet_names\n",
"\n",
"run1 = pd.read_excel(xlsx, sheet_names[0], index_col=0)\n",
Expand All @@ -477,7 +477,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand Down Expand Up @@ -537,7 +537,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand All @@ -557,7 +557,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Answer Here"
]
},
{
Expand Down Expand Up @@ -593,7 +593,7 @@
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/linear-regression/main/olympic_data.csv', encoding='unicode_escape')\n",
"df = pd.read_csv('https://raw.githubusercontent.com/wrmthorne/SWiCS-WIE-Advanced-Python-Worksheets/main/Advanced%201%20-%20Linear%20Regression/olympic_data.csv', encoding='unicode_escape')\n",
"data = df[['Year', 'Result']].loc[df.Event == '100M Women'].mask(df.eq('None')).dropna()\n",
"data = data.sort_values(by=['Year'])\n",
"\n",
Expand Down Expand Up @@ -661,10 +661,23 @@
"metadata": {},
"outputs": [],
"source": [
"# Answer here"
"# Sample answer\n",
"X_basis = polynomial(x, num_basis=len(x))\n",
"lin_reg = linear_model.LinearRegression()\n",
"lin_reg.fit(X_basis, y)\n",
"y_pred = lin_reg.predict(X_basis)\n",
"\n",
"plt.plot(x, y, 'rx', label='Data')\n",
"plt.plot(x, y_pred, label='Prediction')\n",
"plt.title('Medal Times for Women\\'s Olympic 100m')\n",
"plt.xlabel('Year')\n",
"plt.ylabel('Time (Seconds)')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -680,104 +693,15 @@
"\n",
"There are two ways of minimising the loss w.r.t $m$ and $c$. It can be solved using linear algebra (the method which will be covered here) or it can be performed iteratively, that is, by updating the values by a incremental amount ([learning rate](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/)) at each iteration of a loop. This is known as [gradient descent](https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21) and is the primary algorithm used on extremely large models where there is too much data to store in memory at any one time to solve algebraically.\n",
"\n",
"All standard models of linear regression will make use of matrix multiplication. If you haven't used matrices before, [khan academy](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:mat-intro/v/introduction-to-the-matrix) has a really good series on them. We will first cover how to use matrices in numpy."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define matrix A and matrix B (2D numpy arrays)\n",
"A = np.random.randint(1, 5, size=(4, 4))\n",
"B = np.random.randint(1, 5, size=(4, 4))\n",
"\n",
"print(A, end='\\n\\n')\n",
"print(B)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dot and outer products of matrices can be performed in numpy simply. Dot product can be performed using `np.dot()` (or `@` but it is very [slightly different](https://stackoverflow.com/questions/34142485/difference-between-numpy-dot-and-python-3-5-matrix-multiplication) in some cases) and the outer product can be calculated with `np.outer()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# np.dot() and @ are basically equivalent\n",
"print(np.dot(A, B), end='\\n\\n')\n",
"print(A @ B, end='\\n\\n')\n",
"\n",
"print(np.outer(A, B))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Matrices can have their inverse inverse and transpose applied"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Transposition\n",
"print(A, end='\\n\\n')\n",
"print(A.T, end='\\n\\n')\n",
"\n",
"# Inversion\n",
"print(np.linalg.inv(A))"
"All standard models of linear regression will make use of matrix multiplication. If you haven't used matrices before, [khan academy](https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:matrices/x9e81a4f98389efdf:mat-intro/v/introduction-to-the-matrix) has a really good series on them. If you need a recap on how to use matrices in NumPy, we covered them in our [numpy workshop](https://github.com/wrmthorne/SWiCS-WIE-Intermediate-Python-Worksheets/tree/main/Intermediate%201%20-%20Numpy%20and%20Plotting)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now show how summation across a formula is equivalent to matrix multiplication when use in a specific way. If we define a really large vector C and we want to find the sum of the sqares of all elements in C, the same operation can be achieved much faster using matrix multiplication:\n",
"\n",
"$$\n",
"\\sum_{i=1}^{n}c_i^2 = C^T \\cdot C\n",
"$$\n",
"\n",
"This difference may not seem important for a simple calculation like this but by using matrix multiplication, we avoid iteration which is a major bottle neck in computation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Reduce the size if this cell takes more than a couple of seconds\n",
"C = np.random.randint(1, 5, size=10000000)\n",
"\n",
"# Summation \n",
"start_time = time.time()\n",
"summation = sum(C**2)\n",
"print(f'Summation: {time.time() - start_time:.4f}s')\n",
"\n",
"# Matrix multiplication\n",
"start_time = time.time()\n",
"mat_mul = np.dot(C.T, C)\n",
"print(f'Matrix Multiplication: {time.time() - start_time:.4f}s')\n",
"\n",
"print(f'Equivalent?: {summation == mat_mul}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have gone through the basics of matrices in numpy, we can start to look at how we can apply this to linear regression. First, we need to understand how we can convert objective function into a vectorised form. We can take our original objective function and stack the two parameters into a weight matrix $\\mathbf{w}$:\n",
"First, we need to understand how we can convert objective function into a vectorised form. We can take our original objective function and stack the two parameters into a weight matrix $\\mathbf{w}$:\n",
"\n",
"$$\n",
"\\mathbf{w} = \\begin{bmatrix} c \\\\ m \\end{bmatrix}\n",
Expand Down Expand Up @@ -852,6 +776,10 @@
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(42)\n",
"A = np.random.randint(0, 10, (4, 4))\n",
"B = np.random.randint(0, 10, (4, 4))\n",
"\n",
"# Automatic linear algebra solving\n",
"w = np.linalg.solve(A, B)\n",
"\n",
Expand Down Expand Up @@ -1011,12 +939,13 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making a Prediction\n",
"\n",
"In 2020, E. Thompson-Herah won the women's 100m with a time of 10.61s. Let's see how our predictions line up with that"
"In 2020, E. Thompson-Herah won the women's 100m with a time of 10.61s. Let's see how our predictions line up with that. Although the error of basis 3 is higher, by the next olympics, a pure linear prediction will be far more inaccurate. In about 100 years, the world record sprint time will drop below 0s which can't happen."
]
},
{
Expand All @@ -1040,11 +969,8 @@
}
],
"metadata": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
},
"kernelspec": {
"display_name": "Python 3.8.10 64-bit",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
Expand All @@ -1058,9 +984,14 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.10.6"
},
"orig_nbformat": 4
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
Expand Down
Loading

0 comments on commit 625d2e1

Please sign in to comment.