differences for PR #32

carpentries-incubator · Nov 22, 2024 · 75bf912 · 75bf912
1 parent a7909ce
commit 75bf912
Show file tree

Hide file tree

Showing 2 changed files with 139 additions and 100 deletions.
diff --git a/2-code-generation-optimization.md b/2-code-generation-optimization.md
@@ -233,95 +233,135 @@ The following shortcuts can be used to speed up your workflow:
 
 ## Hands-on Practice
 
-In the following exercises, you will have the opportunity to practice using Codeium's Command, Chat, and Autocomplete features to generate, optimize, and refactor code. 
+In the following exercises, you will have the opportunity to practice using Codeium's Command, Chat, and Autocomplete features to generate, optimize, and refactor code. Create a python file (for example `exercise.py`) in your IDE and follow along with the exercises.
+
+::::::::::::::::::::::::::::::::::::: callout
+
+### Jupyter Notebooks (Not Recommended)
+
+It is also possible to use Codeium in Jupyter Notebooks, but for the best experience, it is recommended to use Jupyter Lab after installing the officially provided [Codeium extension for JupyterLab](https://codeium.com/jupyter_tutorial).
+
+Even if it is possible to use Codeium in Jupyter Notebooks directly within VS Code, the experience may not be as smooth as in a standard Python files. Indeed, Windows users may encounter issues with some of the Codeium shortcuts and features.
+
+:::::::::::::::::::::::::::::::::::::
 
 ### Code Generation
 
-Let's start by exploring the Command mode and generating code snippets to analyze a dataset. In Command mode, copy and paste the following text into your editor (you can also break it down in smaller pieces if you prefer):
+Let's start by exploring the Command mode and generating code snippets to analyze a dataset. In Command mode, keeping the python file open, press `⌘(Command)+I` on Mac or `Ctrl+I` on Windows/Linux to open the Command prompt. Then, copy and paste the following text into your editor (you can also break it down in smaller pieces if you prefer):
 
 ```output
-Load a [CO2 concentration dataset](https://datahub.io/core/co2-ppm/) from the file `co2-mm-mlo.csv` into a Pandas DataFrame, then generate descriptive statistics and visualize data distributions. You can download the dataset using the following URL: https://edu.nl/k6v7x.
+Load a [CO2 concentration dataset](https://datahub.io/core/co2-ppm/) from the file `co2-mm-mlo.csv` into a Pandas DataFrame, then generate descriptive statistics and visualize data distributions. Read the dataset using the following URL: https://edu.nl/k6v7x.
 
 1.  Write a function that takes a DataFrame as input and calculates key descriptive statistics, including:
 
    - Number of rows and columns
    - Data types of each column
-   - Summary statistics (e.g., mean, minimum, maximum) for numeric columns
+   - Summary statistics (e.g., mean, minimum, maximum)
+
+   Compute the statistics only for the numeric columns.
+
+2. Write a function that accepts a DataFrame and a specific column as inputs, and creates a new figure in which it plots its distribution. If the column is numeric (e.g., `int64`, `float64`), create a histogram to display its distribution; if categorical, create a bar plot to show category frequencies. Add the name of the column to the title.
 
-2. Write a function that accepts a DataFrame and a specific column as inputs. If the column is numeric (e.g., `int64`, `float64`), create a histogram to display its distribution; if categorical, create a bar plot to show category frequencies.
+3. Write a function that creates a new figure in which it plots the `Average` and `Interpolated` columns on a single graph, with `Date` on the x-axis, to visualize their distributions over time. 
 
-3. Write a function to plot the `Average` and `Interpolated` columns on a single graph, with Date on the x-axis, to visualize their distributions over time.
+4. In the main, print nicely the information computed in 1., run the function defined in 2. on all columns, and run the function defined in 3. Use the `show()` functionality to display the figures only at the end of the main. 
 ```
 
 Here is what you would expect to see in the generated code:
 
 ```python
-import pandas as pd
 import matplotlib.pyplot as plt
+import pandas as pd
+
 
-# Load the dataset
-url = 'https://edu.nl/k6v7x'
-df = pd.read_csv(url)
-
-def calculate_descriptive_stats(data_frame):
-    nrow, ncol = data_frame.shape
-    data_types = data_frame.dtypes
-    summary_stats = data_frame.describe()
-    return nrow, ncol, data_types, summary_stats
-
-def visualize_column_distribution(data_frame, column):
-    if data_frame[column].dtype in ['int64', 'float64']:
-        plt.hist(data_frame[column], bins=20, edgecolor='k')
-        plt.xlabel(column)
-        plt.ylabel('Frequency')
-        plt.title(f'Histogram of {column}')
+def get_descriptive_stats(df):
+    """
+    Calculate key descriptive statistics for a given DataFrame.
+    """
+    stats = {"nrow": df.shape[0], "ncol": df.shape[1]}
+    for col in df.select_dtypes(include="number").columns:
+        stats[col] = {
+            "dtype": df[col].dtype,
+            "mean": df[col].mean(),
+            "min": df[col].min(),
+            "max": df[col].max(),
+        }
+    return stats
+
+
+def plot_distribution(df, column):
+    """
+    Plot the distribution of a given column in a DataFrame.
+    """
+    fig, ax = plt.subplots()
+    if df[column].dtype.kind in "bifc":
+        df[column].plot.hist(ax=ax, bins=50)
     else:
-        data_frame[column].value_counts().plot(kind='bar')
-        plt.xlabel(column)
-        plt.ylabel('Count')
-        plt.title(f'Bar Plot of {column}')
-    plt.show()
+        df[column].value_counts().plot.bar(ax=ax)
+    ax.set_title(column)
 
-def plot_average_and_interpolated(data_frame):
-    data_frame['Date'] = pd.to_datetime(data_frame['Date'])
-    plt.figure(figsize=(12, 6))
-    plt.plot(data_frame['Date'], data_frame['Average'], label='Average')
-    plt.plot(data_frame['Date'], data_frame['Interpolated'], label='Interpolated', linestyle='--')
-    plt.xlabel('Date')
-    plt.ylabel('CO2 Concentration (ppm)')
-    plt.title('Average vs Interpolated CO2 Concentrations Over Time')
-    plt.legend()
-    plt.grid()
-    plt.show()
 
-# Example usage
-nrow, ncol, data_types, summary_stats = calculate_descriptive_stats(df)
-print(f'Number of rows: {nrow}, Number of columns: {ncol}')
-print(f'Data types:\n{data_types}')
-print(f'Summary statistics:\n{summary_stats}')
+def plot_time_series(df):
+    """
+    Plot the Average and Interpolated columns over time.
+    """
+    fig, ax = plt.subplots()
+    df.plot(x="Date", y=["Average", "Interpolated"], ax=ax)
+
+
+def main():
+    url = "https://edu.nl/k6v7x"
+    df = pd.read_csv(url)
+    stats = get_descriptive_stats(df)
+    print(pd.DataFrame(stats).T)
+    for col in df.columns:
+        plot_distribution(df, col)
+    plot_time_series(df)
+    plt.show()
 
-for col in df.columns:
-    visualize_column_distribution(df, col)
 
-plot_average_and_interpolated(df)
+if __name__ == "__main__":
+    main()
 ```
 
 There is something wrong here, can you spot it? We will address this issue later in the "Bug Fixing" exercise, so keep it in mind as you proceed.
 
+::::::::::::::::::::::::::::::::::::: callout
+
+### Pseudo-randomness 🔍
+
+You may obtain slightly different results due to the pseudo-randomness of the command mode generation process.
+
+:::::::::::::::::::::::::::::::::::::
+
+::::::::::::::::::::::::::::::::::::: callout
+
+### Instructions 🔍
+
+The instructions provided in the text were clear and precise, designed to achieve the expected results accurately using the command mode. Try experimenting with removing, rearranging, or adding details to the instructions. You’ll notice that the assistant might generate slightly different code, which occasionally may not fully meet your intended goal.
+
+This exercise highlights the importance of having a clear understanding of what you want to achieve when seeking help from an assistant. It allows you to refine or adjust the instructions to guide the tool effectively toward your objective. Relying too heavily on the assistant can lead to mistakes, a point we will emphasize repeatedly throughout this lesson.
+
+:::::::::::::::::::::::::::::::::::::
+
 ### Docstrings Generation
 
-Now, let's modify the `calculate_descriptive_stats()` and `visualize_column_distribution()` functions you created during the previous exercise to add a detailed docstring using Codeium's `Docstring` lens. Each docstring should:
+Now, let's modify the `get_descriptive_stats()` and `plot_column_distribution()` functions' docstrings you created during the previous exercise to add further details using Codeium's `Refactor` lens. Each docstring should:
 
 - Describe the purpose of the function
 - Document the function’s arguments and expected data types
 - Explain what the function returns (if applicable)
 - Optionally, provide a usage example
 
+To do this, click on the `Refactor` lens above the function definition and select the `Add docstring and comments to the code` option. Codeium will add more details to the existing docstring, making it more informative and useful.
+
+Note that if you don't have a docstring yet in your function definition, another lens will appear to help you generate one, the `Generate Docstring` lens. Try experimenting with both lenses to see how they can improve your code documentation.
+
 ::::::::::::::::::::::::::::::::::::: callout
 
 ### 💡 Tip
 
-Try experimenting with different docstring styles! For example, you could also explore the [Google-style docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) using the `Refactor` lens or the Command mode. The default style used by the `Docstring` lens should be the [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html).
+Try experimenting with different docstring styles! For example, you could also explore the [Google-style docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) using the `Refactor` lens or the Command mode. The default style used by the lenses should be the [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html).
 
 :::::::::::::::::::::::::::::::::::::
 
@@ -333,62 +373,60 @@ While Command mode is not aware of the context of your code and doesn't maintain
 
 :::::::::::::::::::::::::::::::::::::
 
-Please note that, while you could manually write the docstring and use suggestions from Autocomplete mode (which we will cover later in this episode), this task is designed to demonstrate Codeium's `Docstring` functionality.
-
-Here’s an example of how the `calculate_descriptive_stats()` and the `visualize_column_distribution() `functions might look with the generated docstrings:
+Here’s an example of how the `get_descriptive_stats()` and the `plot_column_distribution() `functions might look with the refactored docstrings:
 
 ```python
-def calculate_descriptive_stats(data_frame):
+def plot_distribution(df, column):
     """
-    Calculate the number of rows, number of columns, data types of columns,
-    and descriptive statistics of a given DataFrame.
+    Plot the distribution of a given column in a DataFrame.
+
+    For numerical columns, a histogram is plotted. For categorical columns,
+    a bar plot of the counts is plotted.
 
     Parameters
     ----------
-    data_frame : pandas.DataFrame
-        The DataFrame to be analyzed
+    df : DataFrame
+        The DataFrame to plot the distribution for.
+    column : str
+        The column to plot the distribution for.
 
     Returns
     -------
-    tuple
-        A tuple containing (nrow, ncol, data_types, summary_stats)
+    None
     """
-    nrow, ncol = data_frame.shape
-    data_types = data_frame.dtypes
-    summary_stats = data_frame.describe()
-    return nrow, ncol, data_types, summary_stats
+    fig, ax = plt.subplots()
+    if df[column].dtype.kind in "bifc":
+        # Plot a histogram for numerical columns
+        df[column].plot.hist(ax=ax, bins=50)
+    else:
+        # Plot a bar plot of the counts for categorical columns
+        df[column].value_counts().plot.bar(ax=ax)
+    ax.set_title(column)
 
-def visualize_column_distribution(data_frame, column):
+def plot_column_distribution(df, column):
     """
-    Visualize the distribution of the given column in a DataFrame.
+    Plot the distribution of a given column in a DataFrame.
 
     Parameters
     ----------
-    data_frame : pandas.DataFrame
-        The DataFrame containing the column to be visualized
+    df : DataFrame
+        The DataFrame containing the data.
     column : str
-        The column name to be visualized
-
-    Returns
-    -------
-    None
-
-    Notes
-    -----
-    If the column is numeric (int64 or float64), a histogram is plotted.
-    Otherwise, a bar plot of the value counts is plotted.
-    """    
-    if data_frame[column].dtype in ["int64", "float64"]:
-        plt.hist(data_frame[column], bins=20, edgecolor="k")
-        plt.xlabel(column)
-        plt.ylabel("Frequency")
-        plt.title(f"Histogram of {column}")
+        The column name in the DataFrame for which to plot the distribution.
+    """
+    # Create a new figure and axis for the plot
+    fig, ax = plt.subplots()
+
+    # Check if the column is of a numeric type
+    if df[column].dtype.kind in "bifc":
+        # Plot a histogram for numeric data
+        df[column].plot.hist(ax=ax, bins=50)
     else:
-        data_frame[column].value_counts().plot(kind="bar")
-        plt.xlabel(column)
-        plt.ylabel("Count")
-        plt.title(f"Bar Plot of {column}")
-    plt.show()
+        # Plot a bar chart for categorical data
+        df[column].value_counts().plot.bar(ax=ax)
+
+    # Set the title of the plot to the column name
+    ax.set_title(column)
 ```
 
 Note that you might need to adjust the generated docstring if the function has complex logic or if the generated docstring lacks specific details about edge cases or exceptions.
@@ -397,7 +435,7 @@ Note that you might need to adjust the generated docstring if the function has c
 
 ## Bug Fixing (5 min)
 
-Look back at the code generated during the "Assisted Code Generation" section. If you look at the head of the DataFrame, what do you notice? Use the Chat feature to discuss the issue with Codeium and ask for suggestions on how to resolve it. Then run again the functions defined in the previous exercise to see if the issue has been resolved.
+Look back at the code generated during the "Code Generation" section. If you look at the head of the DataFrame, what do you notice? Use the Chat feature to discuss the issue with Codeium and ask for suggestions on how to resolve it. Then run again the functions defined in the previous exercise to see if the issue has been resolved.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -407,7 +445,7 @@ Look back at the code generated during the "Assisted Code Generation" section. I
 
 The issue is that the `Date` column is used as index column, causing all the other columns to shift by one. Here’s how you might discuss the issue with Codeium in the Chat:
 
-1. **Prompt**: "The `Date` column is being used as the index, causing the other columns to shift by one. How can I resolve this issue?"
+1. **Prompt**: "The `Date` column is being used as the index, causing the other columns to shift by one. How can I read the file without encourring into this issue?"
 2. **Discussion**: Codeium might suggest resetting the index or using the `reset_index()` function to address the issue. Alternatively, it might recommend setting `index_col=False` when reading the CSV file to prevent the `Date` column from being used as the index.
 
 Correct example of how to resolve the issue:
@@ -444,7 +482,7 @@ Or even like this:
 df['Avg-Int'] = df['Average'] - df['Interpolated']
 ```
 
-This version is faster and more memory-efficient because it uses vectorized operations, which are a key feature of the pandas library.
+This version is faster and more memory-efficient because it uses vectorized operations, which are a key feature of the `pandas` library.
 
 ::::::::::::::::::::::::::::::::::::: challenge
 
@@ -454,16 +492,16 @@ Similar to the exercise above, execute the code as is to verify it works and exa
 
 ```python
 # Convert 'Date' column to datetime format
-data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m')
+df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m')
 
 # Filter data for a specific date range
-filtered_data = data[(data['Date'] >= '2000-01-01') & (data['Date'] <= '2010-12-31')]
+filtered_df = df[(df['Date'] >= '2000-01-01') & (df['Date'] <= '2010-12-31')]
 
 # Extract the year value from the 'Date' column
-filtered_data['Year'] = filtered_data['Date'].dt.year
+filtered_df['Year'] = filtered_df['Date'].dt.year
 
 # Group data by year and calculate the average CO2 level for each year
-avg_co2_per_year = filtered_data.groupby('Year')['Interpolated'].mean()
+avg_co2_per_year = filtered_df.groupby('Year')['Interpolated'].mean()
 
 # Plot the results
 plt.figure(figsize=(10, 6))
@@ -485,12 +523,13 @@ plt.show()
 
 ```python
 # Convert 'Date' column to datetime format and filter data for a specific date range
-filtered_data = data[(pd.to_datetime(data['Date'], format='%Y-%m') >= '2000-01-01') & 
-                     (pd.to_datetime(data['Date'], format='%Y-%m') <= '2010-12-31')]
+filtered_df = df[
+    (pd.to_datetime(df['Date'], format='%Y-%m') >= '2000-01-01') & 
+    (pd.to_datetime(df['Date'], format='%Y-%m') <= '2010-12-31')]
 
 
 # Group data by year and calculate the average CO2 level for each year
-avg_co2_per_year = filtered_data.groupby(pd.to_datetime(filtered_data['Date'], format='%Y-%m').dt.year)['Interpolated'].mean()
+avg_co2_per_year = filtered_df.groupby(pd.to_datetime(filtered_df['Date'], format='%Y-%m').dt.year)['Interpolated'].mean()
 
 
 # Plot the results
@@ -507,7 +546,7 @@ plt.show()
 
 - Combined the `pd.to_datetime` conversion and filtering steps into one.
 
-- Removed the unnecessary `filtered_data['Year']` column and used the `dt.year` accessor to extract the year from the `'Date'` column.
+- Removed the unnecessary `filtered_df['Year']` column and used the `dt.year` accessor to extract the year from the `'Date'` column.
 
 - Simplified the plotting code by using the `plot` method of the Series object and removing the unnecessary `plt.figure` call.
 

diff --git a/md5sum.txt b/md5sum.txt
@@ -5,7 +5,7 @@
 "index.md" "34399a5e151cea103c92aa7767f4c118" "site/built/index.md" "2024-09-13"
 "links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2022-04-22"
 "episodes/1-introduction-ai-coding.md" "94f89961042798b06550db4272371638" "site/built/1-introduction-ai-coding.md" "2024-10-28"
-"episodes/2-code-generation-optimization.md" "b4bdd36643ceff7c94d63b212ae55a94" "site/built/2-code-generation-optimization.md" "2024-11-21"
+"episodes/2-code-generation-optimization.md" "b116743349e7c23ad5ac15a720869ad5" "site/built/2-code-generation-optimization.md" "2024-11-22"
 "episodes/3-ethical-and-security-considerations.md" "5540ef40fd86510c8eba8273204334c5" "site/built/3-ethical-and-security-considerations.md" "2024-10-28"
 "instructors/instructor-notes.md" "d96cb9e76302dee237d0897fb5c1b1a7" "site/built/instructor-notes.md" "2024-09-13"
 "learners/reference.md" "86cca05410972bf6feb3e65095c6c89b" "site/built/reference.md" "2024-10-23"