Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken links (#152) #202

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion applications/async-web-server.ipynb
Original file line number Diff line number Diff line change
@@ -364,7 +364,7 @@
"source": [
"### Other options\n",
"\n",
"In these situations people today tend to use [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) or [Celery](http://www.celeryproject.org/).\n",
"In these situations people today tend to use [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) or [Celery](https://docs.celeryproject.org/en/latest/index.html).\n",
"\n",
"- concurrent.futures allows easy parallelism on a single machine and integrates well into async frameworks. The API is exactly what we showed above (Dask implements the concurrent.futures API). However concurrent.futures doesn't easily scale out to a cluster.\n",
"- Celery scales out more easily to multiple machines, but has higher latencies, doesn't scale down as nicely, and needs a bit of effort to integrate into async frameworks (or at least this is my understanding, my experience here is shallow)\n",
6 changes: 3 additions & 3 deletions applications/stencils-with-numba.ipynb
Original file line number Diff line number Diff line change
@@ -16,7 +16,7 @@
"In particular we show off two Numba features, and how they compose with Dask:\n",
"\n",
"1. Numba's [stencil decorator](https://numba.pydata.org/numba-doc/dev/user/stencil.html)\n",
"2. NumPy's [Generalized Universal Functions](https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html)\n",
"2. NumPy's [Generalized Universal Functions](https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html)\n",
"\n",
"*This was originally published as a blogpost [here](https://blog.dask.org/2019/04/09/numba-stencil)*"
]
@@ -168,7 +168,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And then because each of the chunks of a Dask array are just NumPy arrays, we can use the [map_blocks](https://docs.dask.org/en/latest/array-api.html#dask.array.core.map_blocks) function to apply this function across all of our images, and then save them out.\n",
"And then because each of the chunks of a Dask array are just NumPy arrays, we can use the [map_blocks](https://docs.dask.org/en/latest/generated/dask.array.map_blocks.html) function to apply this function across all of our images, and then save them out.\n",
"\n",
"This is fine, but lets go a bit further, and discuss generalized universal functions from NumPy."
]
@@ -200,7 +200,7 @@
"\n",
"**Numba Docs:** https://numba.pydata.org/numba-doc/dev/user/vectorize.html\n",
"\n",
"**NumPy Docs:** https://docs.scipy.org/doc/numpy-1.16.0/reference/c-api. generalized-ufuncs.html\n",
"**NumPy Docs:** https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html\n",
"\n",
"A generalized universal function (gufunc) is a normal function that has been\n",
"annotated with typing and dimension information. For example we can redefine\n",
2 changes: 1 addition & 1 deletion dataframes/02-groupby.ipynb
Original file line number Diff line number Diff line change
@@ -130,7 +130,7 @@
"source": [
"This is the same as with Pandas. Generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations, just more scalable.\n",
"\n",
"You can read more about Pandas' common aggregations in [the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation).\n",
"You can read more about Pandas' common aggregations in [the Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#aggregation).\n",
"\n"
]
},
18 changes: 9 additions & 9 deletions dataframes/03-from-pandas-to-dask.ipynb
Original file line number Diff line number Diff line change
@@ -242,7 +242,7 @@
}
},
"source": [
"* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) \n",
"* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.compute.html) \n",
" (or `head()` which runs under the hood compute()) )"
]
},
@@ -481,7 +481,7 @@
},
"source": [
"## Creating a `Dask dataframe` from `Pandas`\n",
"In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf) with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. \n",
"In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf) with the [from_pandas](https://docs.dask.org/en/latest/generated/dask.dataframe.from_pandas.html) method. \n",
"You must supply the number of partitions or chunksize that will be used to generate the dask dataframe"
]
},
@@ -1211,7 +1211,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information see [dask mask documentation](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.mask)"
"For more information see [dask mask documentation](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.mask.html)"
]
},
{
@@ -1597,7 +1597,7 @@
"metadata": {},
"source": [
"### Map partitions\n",
"* We can supply an ad-hoc function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method. \n",
"* We can supply an ad-hoc function to run on each partition using the [map_partitions](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.map_partitions.html) method. \n",
"Mainly useful for functions that are not implemented in `Dask` or `Pandas` . \n",
"* Finally we can return a new `dataframe` which needs to be described in the `meta` argument \n",
"The function could also include arguments."
@@ -2274,7 +2274,7 @@
"ddf = dd.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False). \n",
"* However some are not available such as `nrows`.\n",
"\n",
"[see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming)."
"[see documentaion](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.to_csv.html) (including the option for output file naming)."
]
},
{
@@ -2415,9 +2415,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions) \n",
"To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.npartitions.html) \n",
"\n",
"To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation."
"To change the number of output files use [repartition](https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.repartition.html) which is an expensive operation."
]
},
{
@@ -2527,7 +2527,7 @@
"source": [
" ## Consider using client.persist()\n",
" Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell. Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to keep the results in memory \n",
"Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes) \n",
"Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an example in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes) \n",
"This concept should also be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.\n"
]
},
@@ -2893,7 +2893,7 @@
"metadata": {},
"source": [
"We can do better... \n",
"Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better"
"Using [dask custom aggregation](https://docs.dask.org/en/latest/generated/dask.dataframe.groupby.Aggregation.html) is consideribly better"
]
},
{
4 changes: 2 additions & 2 deletions dataframes/04-reading-messy-data-into-dataframes.ipynb
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@
"source": [
"# DataFrames: Reading in messy data\n",
" \n",
"In the [01-data-access](./01-data-access.ipynb) example we show how Dask Dataframes can read and store data in many of the same formats as Pandas dataframes. One key difference, when using Dask Dataframes is that instead of opening a single file with a function like [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), we typically open many files at once with [dask.dataframe.read_csv](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv). This enables us to treat a collection of files as a single dataset. Most of the time this works really well. But real data is messy and in this notebook we will explore a more advanced technique to bring messy datasets into a dask dataframe."
"In the [01-data-access](./01-data-access.ipynb) example we show how Dask Dataframes can read and store data in many of the same formats as Pandas dataframes. One key difference, when using Dask Dataframes is that instead of opening a single file with a function like [pandas.read_csv](https://docs.dask.org/en/latest/generated/dask.dataframe.read_csv.html), we typically open many files at once with [dask.dataframe.read_csv](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv). This enables us to treat a collection of files as a single dataset. Most of the time this works really well. But real data is messy and in this notebook we will explore a more advanced technique to bring messy datasets into a dask dataframe."
]
},
{
@@ -408,7 +408,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we use [dask.dataframe.from_delayed](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed). This function creates a Dask DataFrame from a list of delayed objects as long as each delayed object returns a pandas dataframe. The structure of each individual dataframe returned must also be the same."
"Then we use [dask.dataframe.from_delayed](https://docs.dask.org/en/latest/generated/dask.dataframe.from_delayed.html). This function creates a Dask DataFrame from a list of delayed objects as long as each delayed object returns a pandas dataframe. The structure of each individual dataframe returned must also be the same."
]
},
{
2 changes: 1 addition & 1 deletion machine-learning/incremental.ipynb
Original file line number Diff line number Diff line change
@@ -27,7 +27,7 @@
"\n",
"Although this example uses Scikit-Learn's SGDClassifer, the `Incremental` meta-estimator will work for any class that implements `partial_fit` and the [scikit-learn base estimator API].\n",
"\n",
"<img src=\"http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png\"> <img src=\"https://www.continuum.io/sites/default/files/dask_stacked.png\" width=\"100px\">\n",
"<img src=\"http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png\"> <img src=\"http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg\" width=\"200px\">\n",
"\n",
"[scikit-learn base estimator API]:http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html\n",
"\n"
1 change: 0 additions & 1 deletion machine-learning/xgboost.ipynb
Original file line number Diff line number Diff line change
@@ -272,7 +272,6 @@
"metadata": {},
"source": [
"## Learn more\n",
"* Similar example that uses DataFrames for a real world dataset: http://ml.dask.org/examples/xgboost.html\n",
"* Recorded screencast stepping through the real world example above:\n",
"* A blogpost on dask-xgboost http://matthewrocklin.com/blog/work/2017/03/28/dask-xgboost\n",
"* XGBoost documentation: https://xgboost.readthedocs.io/en/latest/python/python_intro.html#\n",
2 changes: 1 addition & 1 deletion surveys/2019.ipynb
Original file line number Diff line number Diff line change
@@ -151,7 +151,7 @@
"source": [
"Overall, documentation is still the leader across user user groups.\n",
"\n",
"The usage of the [Dask tutorial](https://github.com/dask/dask-tutorial) and the [dask examples](examples.dask.org) are relatively consistent across groups. The primary difference between regular and new users is that regular users are more likely to engage on GitHub.\n",
"The usage of the [Dask tutorial](https://github.com/dask/dask-tutorial) and the [dask examples](https://examples.dask.org) are relatively consistent across groups. The primary difference between regular and new users is that regular users are more likely to engage on GitHub.\n",
"\n",
"From StackOverflow questions and GitHub issues, we have a vague idea about which parts of the library are used.\n",
"The survey shows that (for our respondents at least) DataFrame and Delayed are the most commonly used APIs."