Skip to content

Commit d3b800b

Browse files
committed
freezingdocs
1 parent 5f84b68 commit d3b800b

File tree

4 files changed

+133
-42
lines changed

4 files changed

+133
-42
lines changed

docs/make.jl

+1
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ makedocs(
5555
=#
5656
# Not really sure where this belongs... some in Fluxperimental, aim to delete?
5757
"Custom Layers" => "models/advanced.md", # TODO move freezing to Training
58+
"Advanced tweaking of models" => "tutorials/misc-model-tweaking.md",
5859
],
5960
],
6061
format = Documenter.HTML(

docs/src/models/advanced.md

-41
Original file line numberDiff line numberDiff line change
@@ -75,47 +75,6 @@ Flux.@layer Affine trainable=(W,)
7575

7676
There is a second, more severe, kind of restriction possible. This is not recommended, but is included here for completeness. Calling `Functors.@functor Affine (W,)` means that all no exploration of the model will ever visit the other fields: They will not be moved to the GPU by [`gpu`](@ref), and their precision will not be changed by `f32`. This requires the `struct` to have a corresponding constructor that accepts only `W` as an argument.
7777

78-
79-
## Freezing Layer Parameters
80-
81-
When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.
82-
83-
!!! compat "Flux ≤ 0.14"
84-
The mechanism described here is for Flux's old "implicit" training style.
85-
When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.
86-
87-
Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
88-
this using the slicing features `Chain` provides:
89-
90-
```julia
91-
m = Chain(
92-
Dense(784 => 64, relu),
93-
Dense(64 => 64, relu),
94-
Dense(32 => 10)
95-
);
96-
97-
ps = Flux.params(m[3:end])
98-
```
99-
100-
The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.
101-
102-
During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.
103-
104-
`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:
105-
106-
```julia
107-
Flux.params(m[1], m[3:end])
108-
```
109-
110-
Sometimes, a more fine-tuned control is needed.
111-
We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
112-
by simply deleting it from `ps`:
113-
114-
```julia
115-
ps = Flux.params(m)
116-
delete!(ps, m[2].bias)
117-
```
118-
11978
## Custom multiple input or output layer
12079

12180
Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the [inception module](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf).

docs/src/training/optimisers.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Flux.Optimise.Optimiser
7676

7777
## Scheduling Optimisers
7878

79-
In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](http://fluxml.ai/ParameterSchedulers.jl/dev/README.html). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
79+
In practice, it is fairly common to schedule the learning rate of an optimiser to obtain faster convergence. There are a variety of popular scheduling policies, and you can find implementations of them in [ParameterSchedulers.jl](http://fluxml.ai/ParameterSchedulers.jl/dev). The documentation for ParameterSchedulers.jl provides a more detailed overview of the different scheduling policies, and how to use them with Flux optimisers. Below, we provide a brief snippet illustrating a [cosine annealing](https://arxiv.org/pdf/1608.03983.pdf) schedule with a momentum optimiser.
8080

8181
First, we import ParameterSchedulers.jl and initialize a cosine annealing schedule to vary the learning rate between `1e-4` and `1e-2` every 10 steps. We also create a new [`Momentum`](@ref) optimiser.
8282
```julia
+131
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Choosing differentiable/gpu parts of the model
2+
!!! note
3+
This tutorial features somewhat disconnected topics about customizing your
4+
models even further. It is advised to be familiar with
5+
[`Flux.@layer`](@ref), [`Flux.@functor`](@ref), [`freeze!`](@ref
6+
Flux.freeze!) and other basics of Flux.
7+
8+
Flux provides several ways of freezing, excluding from backprop entirely and
9+
marking custom struct fields not to be moved to the GPU
10+
([Functors.@functor](@ref)) hence excluded from being trained. The following
11+
subsections should make it clear which one suits your needs the best.
12+
13+
## On-the-fly freezing per model instance
14+
Perhaps you'd like to freeze some of the weights of the model (even at
15+
mid-training), and Flux accomplishes this through [`freeze!`](@ref Flux.freeze!) and `thaw!`.
16+
17+
```julia
18+
m = Chain(
19+
Dense(784 => 64, relu), # freeze this one
20+
Dense(64 => 64, relu),
21+
Dense(32 => 10)
22+
)
23+
opt_state = Flux.setup(Momentum(), m);
24+
25+
# Freeze some layers right away
26+
Flux.freeze!(opt_state.layers[1])
27+
28+
for data in train_set
29+
input, label = data
30+
31+
# Some params could be frozen during the training:
32+
Flux.freeze!(opt_state.layers[2])
33+
34+
grads = Flux.gradient(m) do m
35+
result = m(input)
36+
loss(result, label)
37+
end
38+
Flux.update!(opt_state, m, grads[1])
39+
40+
# Optionally unfreeze the params later
41+
Flux.thaw!(opt_state.layers[1])
42+
end
43+
```
44+
45+
## Static freezing per model definition
46+
Sometimes some parts of the model ([`Flux.@layer`](@ref)) needn't to be trained at all but these params
47+
still need to reside on the GPU (these params are still needed in the forward
48+
and/or backward pass).
49+
```julia
50+
struct MaskedLayer{T}
51+
chain::Chain
52+
mask::T
53+
end
54+
Flux.@layer MyLayer trainable=(chain,)
55+
# mask field will not be updated in the training loop
56+
57+
function (m::MaskedLayer)(x)
58+
# mask field will still move to to gpu for efficient operations:
59+
return m.chain(x) + x + m.mask
60+
end
61+
62+
model = MaskedLayer(...) # this model will not have the `mask` field trained
63+
```
64+
Note how this method permanently sets some model fields to be excluded from
65+
training without on-the-fly changing.
66+
67+
## Excluding from model definition
68+
Sometimes some parameters aren't just "not trainable" but they shouldn't even
69+
transfer to the GPU (or be part of the functor). All scalar fields are like this
70+
by default, so things like learning rate multipliers are not trainable nor
71+
transferred to the GPU by default.
72+
```julia
73+
struct CustomLayer{T, F}
74+
chain::Chain
75+
activation_results::Vector{F}
76+
lr_multiplier::Float32
77+
end
78+
Flux.@functor CustomLayer (chain, ) # Explicitly leaving out `activation_results`
79+
80+
function (m::CustomLayer)(x)
81+
result = m.chain(x) + x
82+
83+
# `activation_results` are not part of the GPU loop, hence we could do
84+
# things like `push!`
85+
push!(m.activation_results, mean(result))
86+
return result
87+
end
88+
```
89+
See more about this in [`Flux.@functor`](@ref)
90+
91+
92+
## Freezing Layer Parameters (deprecated)
93+
94+
When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.
95+
96+
!!! compat "Flux ≤ 0.14"
97+
The mechanism described here is for Flux's old "implicit" training style.
98+
When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.
99+
100+
Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
101+
this using the slicing features `Chain` provides:
102+
103+
```julia
104+
m = Chain(
105+
Dense(784 => 64, relu),
106+
Dense(64 => 64, relu),
107+
Dense(32 => 10)
108+
);
109+
110+
ps = Flux.params(m[3:end])
111+
```
112+
113+
The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.
114+
115+
During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.
116+
117+
`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:
118+
119+
```julia
120+
Flux.params(m[1], m[3:end])
121+
```
122+
123+
Sometimes, a more fine-tuned control is needed.
124+
We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
125+
by simply deleting it from `ps`:
126+
127+
```julia
128+
ps = Flux.params(m)
129+
delete!(ps, m[2].bias)
130+
```
131+

0 commit comments

Comments
 (0)