Skip to content

Commit

Permalink
differences for PR #32
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Nov 22, 2024
1 parent a7909ce commit 75bf912
Show file tree
Hide file tree
Showing 2 changed files with 139 additions and 100 deletions.
237 changes: 138 additions & 99 deletions 2-code-generation-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,95 +233,135 @@ The following shortcuts can be used to speed up your workflow:

## Hands-on Practice

In the following exercises, you will have the opportunity to practice using Codeium's Command, Chat, and Autocomplete features to generate, optimize, and refactor code.
In the following exercises, you will have the opportunity to practice using Codeium's Command, Chat, and Autocomplete features to generate, optimize, and refactor code. Create a python file (for example `exercise.py`) in your IDE and follow along with the exercises.

::::::::::::::::::::::::::::::::::::: callout

### Jupyter Notebooks (Not Recommended)

It is also possible to use Codeium in Jupyter Notebooks, but for the best experience, it is recommended to use Jupyter Lab after installing the officially provided [Codeium extension for JupyterLab](https://codeium.com/jupyter_tutorial).

Even if it is possible to use Codeium in Jupyter Notebooks directly within VS Code, the experience may not be as smooth as in a standard Python files. Indeed, Windows users may encounter issues with some of the Codeium shortcuts and features.

:::::::::::::::::::::::::::::::::::::

### Code Generation

Let's start by exploring the Command mode and generating code snippets to analyze a dataset. In Command mode, copy and paste the following text into your editor (you can also break it down in smaller pieces if you prefer):
Let's start by exploring the Command mode and generating code snippets to analyze a dataset. In Command mode, keeping the python file open, press `⌘(Command)+I` on Mac or `Ctrl+I` on Windows/Linux to open the Command prompt. Then, copy and paste the following text into your editor (you can also break it down in smaller pieces if you prefer):

```output
Load a [CO2 concentration dataset](https://datahub.io/core/co2-ppm/) from the file `co2-mm-mlo.csv` into a Pandas DataFrame, then generate descriptive statistics and visualize data distributions. You can download the dataset using the following URL: https://edu.nl/k6v7x.
Load a [CO2 concentration dataset](https://datahub.io/core/co2-ppm/) from the file `co2-mm-mlo.csv` into a Pandas DataFrame, then generate descriptive statistics and visualize data distributions. Read the dataset using the following URL: https://edu.nl/k6v7x.
1. Write a function that takes a DataFrame as input and calculates key descriptive statistics, including:
- Number of rows and columns
- Data types of each column
- Summary statistics (e.g., mean, minimum, maximum) for numeric columns
- Summary statistics (e.g., mean, minimum, maximum)
Compute the statistics only for the numeric columns.
2. Write a function that accepts a DataFrame and a specific column as inputs, and creates a new figure in which it plots its distribution. If the column is numeric (e.g., `int64`, `float64`), create a histogram to display its distribution; if categorical, create a bar plot to show category frequencies. Add the name of the column to the title.
2. Write a function that accepts a DataFrame and a specific column as inputs. If the column is numeric (e.g., `int64`, `float64`), create a histogram to display its distribution; if categorical, create a bar plot to show category frequencies.
3. Write a function that creates a new figure in which it plots the `Average` and `Interpolated` columns on a single graph, with `Date` on the x-axis, to visualize their distributions over time.
3. Write a function to plot the `Average` and `Interpolated` columns on a single graph, with Date on the x-axis, to visualize their distributions over time.
4. In the main, print nicely the information computed in 1., run the function defined in 2. on all columns, and run the function defined in 3. Use the `show()` functionality to display the figures only at the end of the main.
```

Here is what you would expect to see in the generated code:

```python
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd


# Load the dataset
url = 'https://edu.nl/k6v7x'
df = pd.read_csv(url)

def calculate_descriptive_stats(data_frame):
nrow, ncol = data_frame.shape
data_types = data_frame.dtypes
summary_stats = data_frame.describe()
return nrow, ncol, data_types, summary_stats

def visualize_column_distribution(data_frame, column):
if data_frame[column].dtype in ['int64', 'float64']:
plt.hist(data_frame[column], bins=20, edgecolor='k')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.title(f'Histogram of {column}')
def get_descriptive_stats(df):
"""
Calculate key descriptive statistics for a given DataFrame.
"""
stats = {"nrow": df.shape[0], "ncol": df.shape[1]}
for col in df.select_dtypes(include="number").columns:
stats[col] = {
"dtype": df[col].dtype,
"mean": df[col].mean(),
"min": df[col].min(),
"max": df[col].max(),
}
return stats


def plot_distribution(df, column):
"""
Plot the distribution of a given column in a DataFrame.
"""
fig, ax = plt.subplots()
if df[column].dtype.kind in "bifc":
df[column].plot.hist(ax=ax, bins=50)
else:
data_frame[column].value_counts().plot(kind='bar')
plt.xlabel(column)
plt.ylabel('Count')
plt.title(f'Bar Plot of {column}')
plt.show()
df[column].value_counts().plot.bar(ax=ax)
ax.set_title(column)

def plot_average_and_interpolated(data_frame):
data_frame['Date'] = pd.to_datetime(data_frame['Date'])
plt.figure(figsize=(12, 6))
plt.plot(data_frame['Date'], data_frame['Average'], label='Average')
plt.plot(data_frame['Date'], data_frame['Interpolated'], label='Interpolated', linestyle='--')
plt.xlabel('Date')
plt.ylabel('CO2 Concentration (ppm)')
plt.title('Average vs Interpolated CO2 Concentrations Over Time')
plt.legend()
plt.grid()
plt.show()

# Example usage
nrow, ncol, data_types, summary_stats = calculate_descriptive_stats(df)
print(f'Number of rows: {nrow}, Number of columns: {ncol}')
print(f'Data types:\n{data_types}')
print(f'Summary statistics:\n{summary_stats}')
def plot_time_series(df):
"""
Plot the Average and Interpolated columns over time.
"""
fig, ax = plt.subplots()
df.plot(x="Date", y=["Average", "Interpolated"], ax=ax)


def main():
url = "https://edu.nl/k6v7x"
df = pd.read_csv(url)
stats = get_descriptive_stats(df)
print(pd.DataFrame(stats).T)
for col in df.columns:
plot_distribution(df, col)
plot_time_series(df)
plt.show()

for col in df.columns:
visualize_column_distribution(df, col)

plot_average_and_interpolated(df)
if __name__ == "__main__":
main()
```

There is something wrong here, can you spot it? We will address this issue later in the "Bug Fixing" exercise, so keep it in mind as you proceed.

::::::::::::::::::::::::::::::::::::: callout

### Pseudo-randomness 🔍

You may obtain slightly different results due to the pseudo-randomness of the command mode generation process.

:::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: callout

### Instructions 🔍

The instructions provided in the text were clear and precise, designed to achieve the expected results accurately using the command mode. Try experimenting with removing, rearranging, or adding details to the instructions. You’ll notice that the assistant might generate slightly different code, which occasionally may not fully meet your intended goal.

This exercise highlights the importance of having a clear understanding of what you want to achieve when seeking help from an assistant. It allows you to refine or adjust the instructions to guide the tool effectively toward your objective. Relying too heavily on the assistant can lead to mistakes, a point we will emphasize repeatedly throughout this lesson.

:::::::::::::::::::::::::::::::::::::

### Docstrings Generation

Now, let's modify the `calculate_descriptive_stats()` and `visualize_column_distribution()` functions you created during the previous exercise to add a detailed docstring using Codeium's `Docstring` lens. Each docstring should:
Now, let's modify the `get_descriptive_stats()` and `plot_column_distribution()` functions' docstrings you created during the previous exercise to add further details using Codeium's `Refactor` lens. Each docstring should:

- Describe the purpose of the function
- Document the function’s arguments and expected data types
- Explain what the function returns (if applicable)
- Optionally, provide a usage example

To do this, click on the `Refactor` lens above the function definition and select the `Add docstring and comments to the code` option. Codeium will add more details to the existing docstring, making it more informative and useful.

Note that if you don't have a docstring yet in your function definition, another lens will appear to help you generate one, the `Generate Docstring` lens. Try experimenting with both lenses to see how they can improve your code documentation.

::::::::::::::::::::::::::::::::::::: callout

### 💡 Tip

Try experimenting with different docstring styles! For example, you could also explore the [Google-style docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) using the `Refactor` lens or the Command mode. The default style used by the `Docstring` lens should be the [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html).
Try experimenting with different docstring styles! For example, you could also explore the [Google-style docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) using the `Refactor` lens or the Command mode. The default style used by the lenses should be the [NumPy-style](https://numpydoc.readthedocs.io/en/latest/format.html).

:::::::::::::::::::::::::::::::::::::

Expand All @@ -333,62 +373,60 @@ While Command mode is not aware of the context of your code and doesn't maintain

:::::::::::::::::::::::::::::::::::::

Please note that, while you could manually write the docstring and use suggestions from Autocomplete mode (which we will cover later in this episode), this task is designed to demonstrate Codeium's `Docstring` functionality.

Here’s an example of how the `calculate_descriptive_stats()` and the `visualize_column_distribution() `functions might look with the generated docstrings:
Here’s an example of how the `get_descriptive_stats()` and the `plot_column_distribution() `functions might look with the refactored docstrings:

```python
def calculate_descriptive_stats(data_frame):
def plot_distribution(df, column):
"""
Calculate the number of rows, number of columns, data types of columns,
and descriptive statistics of a given DataFrame.
Plot the distribution of a given column in a DataFrame.
For numerical columns, a histogram is plotted. For categorical columns,
a bar plot of the counts is plotted.
Parameters
----------
data_frame : pandas.DataFrame
The DataFrame to be analyzed
df : DataFrame
The DataFrame to plot the distribution for.
column : str
The column to plot the distribution for.
Returns
-------
tuple
A tuple containing (nrow, ncol, data_types, summary_stats)
None
"""
nrow, ncol = data_frame.shape
data_types = data_frame.dtypes
summary_stats = data_frame.describe()
return nrow, ncol, data_types, summary_stats
fig, ax = plt.subplots()
if df[column].dtype.kind in "bifc":
# Plot a histogram for numerical columns
df[column].plot.hist(ax=ax, bins=50)
else:
# Plot a bar plot of the counts for categorical columns
df[column].value_counts().plot.bar(ax=ax)
ax.set_title(column)

def visualize_column_distribution(data_frame, column):
def plot_column_distribution(df, column):
"""
Visualize the distribution of the given column in a DataFrame.
Plot the distribution of a given column in a DataFrame.
Parameters
----------
data_frame : pandas.DataFrame
The DataFrame containing the column to be visualized
df : DataFrame
The DataFrame containing the data.
column : str
The column name to be visualized
Returns
-------
None
Notes
-----
If the column is numeric (int64 or float64), a histogram is plotted.
Otherwise, a bar plot of the value counts is plotted.
"""
if data_frame[column].dtype in ["int64", "float64"]:
plt.hist(data_frame[column], bins=20, edgecolor="k")
plt.xlabel(column)
plt.ylabel("Frequency")
plt.title(f"Histogram of {column}")
The column name in the DataFrame for which to plot the distribution.
"""
# Create a new figure and axis for the plot
fig, ax = plt.subplots()

# Check if the column is of a numeric type
if df[column].dtype.kind in "bifc":
# Plot a histogram for numeric data
df[column].plot.hist(ax=ax, bins=50)
else:
data_frame[column].value_counts().plot(kind="bar")
plt.xlabel(column)
plt.ylabel("Count")
plt.title(f"Bar Plot of {column}")
plt.show()
# Plot a bar chart for categorical data
df[column].value_counts().plot.bar(ax=ax)

# Set the title of the plot to the column name
ax.set_title(column)
```

Note that you might need to adjust the generated docstring if the function has complex logic or if the generated docstring lacks specific details about edge cases or exceptions.
Expand All @@ -397,7 +435,7 @@ Note that you might need to adjust the generated docstring if the function has c

## Bug Fixing (5 min)

Look back at the code generated during the "Assisted Code Generation" section. If you look at the head of the DataFrame, what do you notice? Use the Chat feature to discuss the issue with Codeium and ask for suggestions on how to resolve it. Then run again the functions defined in the previous exercise to see if the issue has been resolved.
Look back at the code generated during the "Code Generation" section. If you look at the head of the DataFrame, what do you notice? Use the Chat feature to discuss the issue with Codeium and ask for suggestions on how to resolve it. Then run again the functions defined in the previous exercise to see if the issue has been resolved.

::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -407,7 +445,7 @@ Look back at the code generated during the "Assisted Code Generation" section. I

The issue is that the `Date` column is used as index column, causing all the other columns to shift by one. Here’s how you might discuss the issue with Codeium in the Chat:

1. **Prompt**: "The `Date` column is being used as the index, causing the other columns to shift by one. How can I resolve this issue?"
1. **Prompt**: "The `Date` column is being used as the index, causing the other columns to shift by one. How can I read the file without encourring into this issue?"
2. **Discussion**: Codeium might suggest resetting the index or using the `reset_index()` function to address the issue. Alternatively, it might recommend setting `index_col=False` when reading the CSV file to prevent the `Date` column from being used as the index.

Correct example of how to resolve the issue:
Expand Down Expand Up @@ -444,7 +482,7 @@ Or even like this:
df['Avg-Int'] = df['Average'] - df['Interpolated']
```

This version is faster and more memory-efficient because it uses vectorized operations, which are a key feature of the pandas library.
This version is faster and more memory-efficient because it uses vectorized operations, which are a key feature of the `pandas` library.

::::::::::::::::::::::::::::::::::::: challenge

Expand All @@ -454,16 +492,16 @@ Similar to the exercise above, execute the code as is to verify it works and exa

```python
# Convert 'Date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m')
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m')

# Filter data for a specific date range
filtered_data = data[(data['Date'] >= '2000-01-01') & (data['Date'] <= '2010-12-31')]
filtered_df = df[(df['Date'] >= '2000-01-01') & (df['Date'] <= '2010-12-31')]

# Extract the year value from the 'Date' column
filtered_data['Year'] = filtered_data['Date'].dt.year
filtered_df['Year'] = filtered_df['Date'].dt.year

# Group data by year and calculate the average CO2 level for each year
avg_co2_per_year = filtered_data.groupby('Year')['Interpolated'].mean()
avg_co2_per_year = filtered_df.groupby('Year')['Interpolated'].mean()

# Plot the results
plt.figure(figsize=(10, 6))
Expand All @@ -485,12 +523,13 @@ plt.show()

```python
# Convert 'Date' column to datetime format and filter data for a specific date range
filtered_data = data[(pd.to_datetime(data['Date'], format='%Y-%m') >= '2000-01-01') &
(pd.to_datetime(data['Date'], format='%Y-%m') <= '2010-12-31')]
filtered_df = df[
(pd.to_datetime(df['Date'], format='%Y-%m') >= '2000-01-01') &
(pd.to_datetime(df['Date'], format='%Y-%m') <= '2010-12-31')]


# Group data by year and calculate the average CO2 level for each year
avg_co2_per_year = filtered_data.groupby(pd.to_datetime(filtered_data['Date'], format='%Y-%m').dt.year)['Interpolated'].mean()
avg_co2_per_year = filtered_df.groupby(pd.to_datetime(filtered_df['Date'], format='%Y-%m').dt.year)['Interpolated'].mean()


# Plot the results
Expand All @@ -507,7 +546,7 @@ plt.show()

- Combined the `pd.to_datetime` conversion and filtering steps into one.

- Removed the unnecessary `filtered_data['Year']` column and used the `dt.year` accessor to extract the year from the `'Date'` column.
- Removed the unnecessary `filtered_df['Year']` column and used the `dt.year` accessor to extract the year from the `'Date'` column.

- Simplified the plotting code by using the `plot` method of the Series object and removing the unnecessary `plt.figure` call.

Expand Down
2 changes: 1 addition & 1 deletion md5sum.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"index.md" "34399a5e151cea103c92aa7767f4c118" "site/built/index.md" "2024-09-13"
"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2022-04-22"
"episodes/1-introduction-ai-coding.md" "94f89961042798b06550db4272371638" "site/built/1-introduction-ai-coding.md" "2024-10-28"
"episodes/2-code-generation-optimization.md" "b4bdd36643ceff7c94d63b212ae55a94" "site/built/2-code-generation-optimization.md" "2024-11-21"
"episodes/2-code-generation-optimization.md" "b116743349e7c23ad5ac15a720869ad5" "site/built/2-code-generation-optimization.md" "2024-11-22"
"episodes/3-ethical-and-security-considerations.md" "5540ef40fd86510c8eba8273204334c5" "site/built/3-ethical-and-security-considerations.md" "2024-10-28"
"instructors/instructor-notes.md" "d96cb9e76302dee237d0897fb5c1b1a7" "site/built/instructor-notes.md" "2024-09-13"
"learners/reference.md" "86cca05410972bf6feb3e65095c6c89b" "site/built/reference.md" "2024-10-23"
Expand Down

0 comments on commit 75bf912

Please sign in to comment.