Skip to content

Commit

Permalink
[SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration …
Browse files Browse the repository at this point in the history
…Guide tap in Spark documentation

### What changes were proposed in this pull request?

Currently, there is no migration section for PySpark, SparkCore and Structured Streaming.
It is difficult for users to know what to do when they upgrade.

This PR proposes to create create a "Migration Guide" tap at Spark documentation.

![Screen Shot 2019-09-11 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/64688126-ad712f80-d4c6-11e9-8672-9a2c56c05bf8.png)

![Screen Shot 2019-09-11 at 7 27 15 PM](https://user-images.githubusercontent.com/6477701/64689915-389ff480-d4ca-11e9-8c54-7f46095d0d23.png)

This page will contain migration guides for Spark SQL, PySpark, SparkR, MLlib, Structured Streaming and Core. Basically it is a refactoring.

There are some new information added, which I will leave a comment inlined for easier review.

1. **MLlib**
  Merge [ml-guide.html#migration-guide](https://spark.apache.org/docs/latest/ml-guide.html#migration-guide) and [ml-migration-guides.html](https://spark.apache.org/docs/latest/ml-migration-guides.html)

    ```
    'docs/ml-guide.md'
            ↓ Merge new/old migration guides
    'docs/ml-migration-guide.md'
    ```

2. **PySpark**
  Extract PySpark specific items from https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

    ```
    'docs/sql-migration-guide-upgrade.md'
           ↓ Extract PySpark specific items
    'docs/pyspark-migration-guide.md'
    ```

3. **SparkR**
  Move [sparkr.html#migration-guide](https://spark.apache.org/docs/latest/sparkr.html#migration-guide) into a separate file, and extract from [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html)

    ```
    'docs/sparkr.md'                     'docs/sql-migration-guide-upgrade.md'
     Move migration guide section ↘     ↙ Extract SparkR specific items
                     docs/sparkr-migration-guide.md
    ```

4. **Core**
  Newly created at `'docs/core-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

5. **Structured Streaming**
  Newly created at `'docs/ss-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

6. **SQL**
  Merged [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html) and [sql-migration-guide-hive-compatibility.html](https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html)
    ```
    'docs/sql-migration-guide-hive-compatibility.md'     'docs/sql-migration-guide-upgrade.md'
     Move Hive compatibility section ↘                   ↙ Left over after filtering PySpark and SparkR items
                                  'docs/sql-migration-guide.md'
    ```

### Why are the changes needed?

In order for users in production to effectively migrate to higher versions, and detect behaviour or breaking changes before upgrading and/or migrating.

### Does this PR introduce any user-facing change?
Yes, this changes Spark's documentation at https://spark.apache.org/docs/latest/index.html.

### How was this patch tested?

Manually build the doc. This can be verified as below:

```bash
cd docs
SKIP_API=1 jekyll build
open _site/index.html
```

Closes apache#25757 from HyukjinKwon/migration-doc.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
HyukjinKwon authored and dongjoon-hyun committed Sep 15, 2019
1 parent b91648c commit 7d4eb38
Show file tree
Hide file tree
Showing 17 changed files with 1,295 additions and 1,157 deletions.
12 changes: 12 additions & 0 deletions docs/_data/menu-migration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- text: Spark Core
url: core-migration-guide.html
- text: SQL, Datasets and DataFrame
url: sql-migration-guide.html
- text: Structured Streaming
url: ss-migration-guide.html
- text: MLlib (Machine Learning)
url: ml-migration-guide.html
- text: PySpark (Python on Spark)
url: pyspark-migration-guide.html
- text: SparkR (R on Spark)
url: sparkr-migration-guide.html
10 changes: 1 addition & 9 deletions docs/_data/menu-sql.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,7 @@
- text: Usage Notes
url: sql-pyspark-pandas-with-arrow.html#usage-notes
- text: Migration Guide
url: sql-migration-guide.html
subitems:
- text: Spark SQL Upgrading Guide
url: sql-migration-guide-upgrade.html
- text: Compatibility with Apache Hive
url: sql-migration-guide-hive-compatibility.html
- text: SQL Reserved/Non-Reserved Keywords
url: sql-reserved-and-non-reserved-keywords.html

url: sql-migration-old.html
- text: SQL Reference
url: sql-ref.html
subitems:
Expand Down
6 changes: 6 additions & 0 deletions docs/_includes/nav-left-wrapper-migration.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<div class="left-menu-wrapper">
<div class="left-menu">
<h3><a href="migration-guide.html">Migration Guide</a></h3>
{% include nav-left.html nav=include.nav-migration %}
</div>
</div>
7 changes: 5 additions & 2 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@
<li><a href="job-scheduling.html">Job Scheduling</a></li>
<li><a href="security.html">Security</a></li>
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
<li><a href="migration-guide.html">Migration Guide</a></li>
<li class="divider"></li>
<li><a href="building-spark.html">Building Spark</a></li>
<li><a href="https://spark.apache.org/contributing.html">Contributing to Spark</a></li>
Expand All @@ -126,8 +127,10 @@

<div class="container-wrapper">

{% if page.url contains "/ml" or page.url contains "/sql" %}
{% if page.url contains "/ml" %}
{% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "migration-guide.html" %}
{% if page.url contains "migration-guide.html" %}
{% include nav-left-wrapper-migration.html nav-migration=site.data.menu-migration %}
{% elsif page.url contains "/ml" %}
{% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %}
{% else %}
{% include nav-left-wrapper-sql.html nav-sql=site.data.menu-sql %}
Expand Down
32 changes: 32 additions & 0 deletions docs/core-migration-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
layout: global
title: "Migration Guide: Spark Core"
displayTitle: "Migration Guide: Spark Core"
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---

* Table of contents
{:toc}

## Upgrading from Core 2.4 to 3.0

- In Spark 3.0, deprecated method `TaskContext.isRunningLocally` has been removed. Local execution was removed and it always has returned `false`.

- In Spark 3.0, deprecated method `shuffleBytesWritten`, `shuffleWriteTime` and `shuffleRecordsWritten` in `ShuffleWriteMetrics` have been removed. Instead, use `bytesWritten`, `writeTime ` and `recordsWritten` respectively.

- In Spark 3.0, deprecated method `AccumulableInfo.apply` have been removed because creating `AccumulableInfo` is disallowed.

1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ options for deployment:
* Integration with other storage systems:
* [Cloud Infrastructures](cloud-integration.html)
* [OpenStack Swift](storage-openstack-swift.html)
* [Migration Guide](migration-guide.html): Migration guides for Spark components
* [Building Spark](building-spark.html): build Spark using the Maven system
* [Contributing to Spark](https://spark.apache.org/contributing.html)
* [Third Party Projects](https://spark.apache.org/third-party-projects.html): related third party Spark projects
Expand Down
30 changes: 30 additions & 0 deletions docs/migration-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
layout: global
title: Migration Guide
displayTitle: Migration Guide
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---

This page documents sections of the migration guide for each component in order
for users to migrate effectively.

* [Spark Core](core-migration-guide.html)
* [SQL, Datasets, and DataFrame](sql-migration-guide.html)
* [Structured Streaming](ss-migration-guide.html)
* [MLlib (Machine Learning)](ml-migration-guide.html)
* [PySpark (Python on Spark)](pyspark-migration-guide.html)
* [SparkR (R on Spark)](sparkr-migration-guide.html)
65 changes: 2 additions & 63 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,68 +113,7 @@ transforming multiple columns.
* Robust linear regression with Huber loss
([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)).

# Migration guide
# Migration Guide

MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.
The migration guide is now archived [on this page](ml-migration-guide.html).

## From 2.4 to 3.0

### Breaking changes

* `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`.

### Changes of behavior

* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215):
In Spark 2.4 and previous versions, when specifying `frequencyDesc` or `frequencyAsc` as
`stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of
strings is undefined. Since Spark 3.0, the strings with equal frequency are further
sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple
columns.

## From 2.2 to 2.3

### Breaking changes

* The class and trait hierarchy for logistic regression model summaries was changed to be cleaner
and better accommodate the addition of the multi-class summary. This is a breaking change for user
code that casts a `LogisticRegressionTrainingSummary` to a
`BinaryLogisticRegressionTrainingSummary`. Users should instead use the `model.binarySummary`
method. See [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) for more detail
(_note_ this is an `Experimental` API). This _does not_ affect the Python `summary` method, which
will still work correctly for both multinomial and binary cases.

### Deprecations and changes of behavior

**Deprecations**

* `OneHotEncoder` has been deprecated and will be removed in `3.0`. It has been replaced by the
new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
(see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030)). **Note** that
`OneHotEncoderEstimator` will be renamed to `OneHotEncoder` in `3.0` (but
`OneHotEncoderEstimator` will be kept as an alias).

**Changes of behavior**

* [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
The default parallelism used in `OneVsRest` is now set to 1 (i.e. serial). In `2.2` and
earlier versions, the level of parallelism was set to the default threadpool size in Scala.
* [SPARK-22156](https://issues.apache.org/jira/browse/SPARK-22156):
The learning rate update for `Word2Vec` was incorrect when `numIterations` was set greater than
`1`. This will cause training results to be different between `2.3` and earlier versions.
* [SPARK-21681](https://issues.apache.org/jira/browse/SPARK-21681):
Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients
when some features had zero variance.
* [SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957):
Tree algorithms now use mid-points for split values. This may change results from model training.
* [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657):
Fixed an issue where the features generated by `RFormula` without an intercept were inconsistent
with the output in R. This may change results from model training in this scenario.

## Previous Spark versions

Earlier migration guides are archived [on this page](ml-migration-guides.html).

---
96 changes: 82 additions & 14 deletions docs/ml-migration-guides.md → docs/ml-migration-guide.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
---
layout: global
title: Old Migration Guides - MLlib
displayTitle: Old Migration Guides - MLlib
description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
title: "Migration Guide: MLlib (Machine Learning)"
displayTitle: "Migration Guide: MLlib (Machine Learning)"
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
Expand All @@ -20,15 +19,80 @@ license: |
limitations under the License.
---

The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
* Table of contents
{:toc}

## From 2.1 to 2.2
Note that this migration guide describes the items specific to MLlib.
Many items of SQL migration can be applied when migrating MLlib to higher versions for DataFrame-based APIs.
Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.html).

## Upgrading from MLlib 2.4 to 3.0

### Breaking changes
{:.no_toc}

* `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`.

### Changes of behavior
{:.no_toc}

* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215):
In Spark 2.4 and previous versions, when specifying `frequencyDesc` or `frequencyAsc` as
`stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of
strings is undefined. Since Spark 3.0, the strings with equal frequency are further
sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple
columns.

## Upgrading from MLlib 2.2 to 2.3

### Breaking changes
{:.no_toc}

* The class and trait hierarchy for logistic regression model summaries was changed to be cleaner
and better accommodate the addition of the multi-class summary. This is a breaking change for user
code that casts a `LogisticRegressionTrainingSummary` to a
`BinaryLogisticRegressionTrainingSummary`. Users should instead use the `model.binarySummary`
method. See [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) for more detail
(_note_ this is an `Experimental` API). This _does not_ affect the Python `summary` method, which
will still work correctly for both multinomial and binary cases.

### Deprecations and changes of behavior
{:.no_toc}

**Deprecations**

* `OneHotEncoder` has been deprecated and will be removed in `3.0`. It has been replaced by the
new [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator)
(see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030)). **Note** that
`OneHotEncoderEstimator` will be renamed to `OneHotEncoder` in `3.0` (but
`OneHotEncoderEstimator` will be kept as an alias).

**Changes of behavior**

* [SPARK-21027](https://issues.apache.org/jira/browse/SPARK-21027):
The default parallelism used in `OneVsRest` is now set to 1 (i.e. serial). In `2.2` and
earlier versions, the level of parallelism was set to the default threadpool size in Scala.
* [SPARK-22156](https://issues.apache.org/jira/browse/SPARK-22156):
The learning rate update for `Word2Vec` was incorrect when `numIterations` was set greater than
`1`. This will cause training results to be different between `2.3` and earlier versions.
* [SPARK-21681](https://issues.apache.org/jira/browse/SPARK-21681):
Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients
when some features had zero variance.
* [SPARK-16957](https://issues.apache.org/jira/browse/SPARK-16957):
Tree algorithms now use mid-points for split values. This may change results from model training.
* [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657):
Fixed an issue where the features generated by `RFormula` without an intercept were inconsistent
with the output in R. This may change results from model training in this scenario.

## Upgrading from MLlib 2.1 to 2.2

### Breaking changes
{:.no_toc}

There are no breaking changes.

### Deprecations and changes of behavior
{:.no_toc}

**Deprecations**

Expand All @@ -45,9 +109,10 @@ There are no deprecations.
`StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
would always be thrown regardless of the setting of the `handleInvalid` parameter.

## From 2.0 to 2.1
## Upgrading from MLlib 2.0 to 2.1

### Breaking changes
{:.no_toc}

**Deprecated methods removed**

Expand All @@ -59,6 +124,7 @@ There are no deprecations.
* `validateParams` in `Evaluator`

### Deprecations and changes of behavior
{:.no_toc}

**Deprecations**

Expand All @@ -74,9 +140,10 @@ There are no deprecations.
* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.

## From 1.6 to 2.0
## Upgrading from MLlib 1.6 to 2.0

### Breaking changes
{:.no_toc}

There were several breaking changes in Spark 2.0, which are outlined below.

Expand Down Expand Up @@ -171,6 +238,7 @@ Several deprecated methods were removed in the `spark.mllib` and `spark.ml` pack
A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).

### Deprecations and changes of behavior
{:.no_toc}

**Deprecations**

Expand Down Expand Up @@ -221,7 +289,7 @@ Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
`QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
The output buckets will differ for same input data and params.

## From 1.5 to 1.6
## Upgrading from MLlib 1.5 to 1.6

There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
deprecations and changes of behavior.
Expand All @@ -248,7 +316,7 @@ Changes of behavior:
tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
behavior of the simpler `Tokenizer` transformer.

## From 1.4 to 1.5
## Upgrading from MLlib 1.4 to 1.5

In the `spark.mllib` package, there are no breaking API changes but several behavior changes:

Expand All @@ -267,7 +335,7 @@ In the `spark.ml` package, there exists one breaking API change and one behavior
* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.

## From 1.3 to 1.4
## Upgrading from MLlib 1.3 to 1.4

In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:

Expand All @@ -286,7 +354,7 @@ Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all
However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
changes for future releases.

## From 1.2 to 1.3
## Upgrading from MLlib 1.2 to 1.3

In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.

Expand All @@ -313,7 +381,7 @@ Other changes were in `LogisticRegression`:
* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.

## From 1.1 to 1.2
## Upgrading from MLlib 1.1 to 1.2

The only API changes in MLlib v1.2 are in
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
Expand All @@ -339,7 +407,7 @@ The tree `Node` now includes more information, including the probability of the
Examples in the Spark distribution and examples in the
[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly.

## From 1.0 to 1.1
## Upgrading from MLlib 1.0 to 1.1

The only API changes in MLlib v1.1 are in
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
Expand All @@ -365,7 +433,7 @@ simple `String` types.
Examples of the new recommended `trainClassifier` and `trainRegressor` are given in the
[Decision Trees Guide](mllib-decision-tree.html#examples).

## From 0.9 to 1.0
## Upgrading from MLlib 0.9 to 1.0

In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
Expand Down
Loading

0 comments on commit 7d4eb38

Please sign in to comment.