-
Notifications
You must be signed in to change notification settings - Fork 21
/
index.idyll
319 lines (196 loc) · 29.4 KB
/
index.idyll
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
[meta title:"Unraveling The JPEG" description:"JPEG images are everywhere in our digital lives, but behind the veil of familiarity lie algorithms that remove details that are imperceptible to the human eye. This produces the highest visual quality with the smallest file size—but what does that look like? Let's see what our eyes can't see!"
shareImageUrl:"https://parametric.press/issue-01/unraveling-the-jpeg/static/images/share.png"
shareImageWidth:"880"
shareImageHeight:"440" /]
[var name:"parametricSlug" value:"unraveling-the-jpeg" /]
[Nav fullWidth:true /]
[Header
title:`["Unraveling", "the JPEG"]`
longTitle:`["Unraveling", "the JPEG"]`
date:"May 1, 2019"
dek:"JPEG images are everywhere in our digital lives, but behind the veil of familiarity lie algorithms that remove details that are imperceptible to the human eye. This produces the highest visual quality with the smallest file size—but what does that look like? Let's see what our eyes can't see!"
fullWidth:true
headerImage:"static/images/luxury.png"
authors:`[{
name: "Omar Shehata",
role: 'Author & Developer',
url: "https://omarshehata.me/"
}]`
doi:"https://doi.org/10.5281/zenodo.2655041"
archive:`'https://parametric-press-archives.s3.amazonaws.com/issue-01/' + parametricSlug + '.warc.gz'`
source:`"https://github.com/ParametricPress/01-" + parametricSlug `
/]
It's easy to take for granted that you can send a picture to a friend without worrying about what device, browser, or operating system they’re using, but things weren’t always this way. By the early 1980s, computers could store and display digital images, but there were many competing ideas about how best to do that. You couldn't just send an image from one computer to another and expect it to work.
To solve this problem, the Joint Photographic Experts Group (JPEG), a committee of experts from all over the world, was established in 1986 as a joint effort by the ISO (International Organization for Standardization) and the IEC (International Electrotechnical Commission)—two international standards organizations headquartered in Geneva, Switzerland.
JPEG, the group of people, created JPEG, a standard for digital image compression, in 1992. Anyone who's ever used the internet has probably seen a JPEG-encoded image. It is by far the most ubiquitous way of encoding, sending and storing images. From web pages to email to social media, JPEG is used billions of times a day—almost every time we view or send images online. Without JPEG, the web would be a little less colorful, a lot slower, and probably have far fewer cat pictures!
This article is about how to decode a JPEG image. In other words, it's about what it takes to convert the compressed data stored on your computer to the image that appears on the screen. It's worth learning about not just because it's important to understand the technology we all use everyday, but also because, as we unravel the layers of compression, we learn a bit about perception and vision, and about what details our eyes are most sensitive to.
It's also just a lot of fun to play with images this way.
[br/]
[FullWidth]
[img style:`{width: '100%'}` src:"static/images/glitchy-cat.gif"/]
[/FullWidth]
[br/]
## Peering Inside a JPEG
Everything on a computer is stored as a series of binary numbers. Typically, these bits, the zeros and ones, are arranged in groups of eight, known as bytes. When you open a JPEG image on your computer, something (the browser, your operating system, or something else) has to decode the bytes to recover the original image as a list of colors that can then be displayed.
[div]
If you download [Reference content:"The adorable kitten image used in this article was taken by <a href='https://unsplash.com/photos/NodtnCsLdTE'>Mikhail Vasilyev</a>."] that picture of the cat [/Reference] and open it using any text editor, you'll see a bunch of garbled characters.
[/div]
[br/]
[img src:"static/images/blanket-cat-notepad.png"/]
[Caption]
Here I'm using [a href:"https://notepad-plus-plus.org/"]Notepad++[/a] to look at the image file, since common text editors like Windows' Notepad will change the file's binary contents when you save it so it's no longer a valid JPEG.
[/Caption]
[br/]
By opening an image in a text editor, you’ve confused the computer, in the same way you confuse your brain when you rub your eyes too hard and start to see blotches of dimness and color!
These blotches you see—known as [a href:"https://en.wikipedia.org/wiki/Phosphene" target:"_blank"]phosphenes[/a]—don't come from any light stimulus, nor are they hallucinations made up in your mind. They arise because your brain assumes that any electrical signal arriving through the nerves in your eye is conveying light information. The brain needs to make this assumption because there’s no way to know whether a given signal is sound, sight, or something else. All the nerves in your body carry exactly the same type of electrical pulse. When you apply pressure by rubbing your eyes, you’re sending non-visual signals that trigger the receptors in your eye, which your brain interprets—incorrectly, in this case—as vision. You can literally see the pressure!
It’s fun to think about how similar computers are to our brains, but this is also a useful analogy because it illustrates how much the meaning of data—whether carried through a body by nerves or stored in a computer—relies on how you interpret it. All binary data is made up of ones and zeros, basic components that could be conveying any kind of information. Your computer often guesses how to interpret it using clues, like the file extension. Here we’ve forced it to interpret it as text, because that’s what a text editor expects.
To understand how a JPEG image is decoded we need to see the original signals themselves—the binary data. This can be done with a [Tooltip content:"You don't necessarily <i>need</i> a hex editor. You can edit the characters as text, save the file, and re-open it as an image just fine. But since our goal is to understand how JPEG stores images, it's helpful to be able to actually read the bytes!"]hex editor[/Tooltip], or it can be done right here in this webpage! Below is the image next to [Tooltip content:"The only thing omitted here is the header, which is a section of bytes at the beginning of any JPEG image that contain metadata like the width and height. We'll revist this later in the article."]all of its bytes[/Tooltip], represented as decimal numbers. You can make changes to the bytes, and it will re-decode and display the new, edited image as you type.
[RawEditor ref:"Raw1" fullWidth:true imageUrl:"static/images/blanket-cat.jpg" showHeader:false maxWidth:700 /]
[aside]
_Hint: try scrolling down and removing a few chunks. Don't worry, you can always reset the image back to the original!_[br/]
[/aside]
There's a lot you can learn just from playing around with this editor. For example, can you figure out the order the pixels are stored in?
Something strange in the example above is that changing some numbers doesn't seem to impact the image at all, while [EditorLink editor:`refs.Raw1` line:1 search:' 17 ' replace:' 0 ']setting the 17 on line one to 0[/EditorLink] completely ruins the image! Other actions, like [EditorLink editor:`refs.Raw1` line:1988 search:' 7 ' replace:' 254 ']setting the 7 on line 1988 to 254[/EditorLink] change the color, but only for subsequent pixels.
[Aside]
[Recirc slug:parametricSlug /]
[/Aside]
Perhaps what's most peculiar is that some numbers change not just the color but the shape of the image as well. Change the [EditorLink editor:`refs.Raw1` line:12 search:' 70 ' replace:' 2 ']70 in line 12 to 2[/EditorLink] and look at the top row of the image to see what I mean. No matter what JPEG image you're using, you'll always find these mysterious checkerboard patterns when editing the bytes.
It's hard to decipher how the image can be reconstructed from these bytes by playing around like this, because JPEG compression is actually composed of three different compression techniques, which are applied in successive layers. We'll look at each of these layers of compression separately to unravel this mysterious behavior we're seeing.
[inset]
### The three layers of JPEG compression
1. Chrominance Subsampling
2. Discrete Cosine Transform & Quantization
3. Run-Length, Delta & Huffman Encoding
[/inset]
To give you an idea of the scale of this compression, notice that the image above is represented using exactly 79,819 numbers, which makes it about 79 kilobytes. If it were stored with no compression, three numbers would be needed for each pixel—one for each of the red, green and blue components. That would mean a total of 917,700 numbers, or about 917 kilobytes. With JPEG compression, the resulting file is over **ten times smaller**!
In fact, this image can be squeezed into far fewer bytes. Below is the image next to a version of it that was compressed down to just 16 kilobytes, which makes it **fifty-seven times smaller** than the uncompressed version would be!
[img src:"static/images/compression-compare.png"/]
[br/]
If you look closely you'll see that these images are not identical. Both are JPEG-encoded images, but the one on the right is much smaller in terms of file size. It also doesn't look as nice (notice how blocky the colors look in the background). This is why JPEG is known as a **lossy** compression technique; the image changes and loses some detail as a result of the compression.
## 1. Chrominance Subsampling
Here's the image with just the first layer of compression applied.
[ChromaEditor fullWidth:true ref:"Chroma1" imageUrl:"static/images/blanket-cat-smallest.jpg" maxWidth:400 /]
[aside]
_Hint: Notice that removing one number ruins all the colors. But removing exactly 6, or any multiple of 6, has a minimal effect on the image._[br/]
[/aside]
It's a little more straightforward to decipher now. It's almost a simple list of colors, where each byte changes exactly one pixel, and yet it's already almost twice as small as the uncompressed image (which would be around 300 kb for this smaller size). Can you guess why?
You can tell that these numbers don't represent the standard red, green and blue components because [EditorLink editor:`refs.Chroma1` pattern:'zero-all']replacing all the numbers with 0[/EditorLink] turns the image green (as opposed to black).
This is because these bytes represent the [EditorLink editor:`refs.Chroma1` pattern:'isolate-Y' scale:4]Y[/EditorLink] (brightness), [EditorLink editor:`refs.Chroma1` pattern:'isolate-Cb' scale:4]Cb[/EditorLink] (relative blueness), and [EditorLink editor:`refs.Chroma1` pattern:'isolate-Cr' scale:4]Cr[/EditorLink] (relative redness) of the image.
Why not just use RGB? After all, that's how most modern screens work. Your monitor can display any color by turning on red, green and blue lights at various intensities for each pixel. White is displayed by turning on all three colors at full brightness, while black is displayed by turning them all off.
[br/]
[img src:"static/images/pixels.png"/]
// [caption]Adapted from http://www.chem.purdue.edu/jmol/cchem/RGBColors/body_rgbcolors.html. [/caption]
[br/]
That's also very similar to how human eyes work. The color receptors in our eyes known as “cones” are split into three types, each of which is mostly sensitive either to red, green, or blue. Rods, the other type of receptor we have in our eyes, can only detect changes in brightness, but they’re far more sensitive. We have about one hundred and twenty million rods in our eyes, compared to a measly six million cones.
This means that our eyes are much better at detecting changes in brightness than they are at detecting changes in color. If we can separate the color from the brightness, we can remove a bit of the color without anyone noticing. **Chrominance subsampling** is the process of representing an image's color components at a lower resolution than its luminance components. In the example above, each pixel has exactly one Y component, while each discrete group of four pixels has exactly one Cb and one Cr component. So the image contains only a quarter as much color information as it originally did.
[aside]
Using the YCbCr colorspace is not unique to JPEG. In fact, it was originally [a href:"https://en.wikipedia.org/wiki/GeorgesValensi"]developed in 1938 for TV broadcasts[/a]. Not everyone had color TVs, so separating out the color from the luminance allowed everyone to receive the same transmission, and TVs that didn't support color would just use the luminance component.
[/aside]
This is why removing one number from the editor above completely ruins the color. Here, the [Tooltip content:"In general, JPEG images need not store the components in any particular order. They might be interleaved like in this example, or all the Y components of the image could be stored followed by all the Cb etc. The header of the JPEG is what tells us how the components are stored."]components are stored[/Tooltip] as **Y Y Y Y Cb Cr**. Removing the first number causes the Cb value to be interpreted as Y, the Cr as Cb, and creates a ripple effect that flips all the colors across the image.
There's nothing in the JPEG specification that says you must use YCbCr. Most JPEG images you'll find use it because it tends to produce higher quality images after subsampling compared to RGB. You don't have to take this for granted though. You can see for yourself in the grid below what it looks like to subsample each component individually across RGB as well as YCbCr.
[var name:"subsamplePercent" value:0.15 /]
[SubsampleGrid imageUrl:"static/images/blanket-cat-tiny.jpg" subsamplePercent:`subsamplePercent ` maxWidth:200 /]
[div className:"subsample-controls" ]
Subsample percent: [Display value:subsamplePercent format:".0%" /]
[Range value:subsamplePercent min:0 max:1 step:0.05 /]
[br/]
Move slider to adjust the amount of subsampling applied.
[/div]
[aside]
Removing a bit of blue isn't as noticeable as removing red or green. This is because of the six million cones you have in your eyes, about 64% are most sensitive to red, 32% to green, and only 2% to blue.
[/aside]
Subsampling the Y component (bottom left) has the greatest effect on the image quality. Even a tiny bit is already noticeable. You can move the slider to see how removing a greater percentage of each component affects the image.
Converting an image from RGB to YCbCr doesn't make the file size any smaller, but it does make it easier to find less noticeable details to remove. It's that second step where the actual lossy compression happens. This idea of finding new ways to represent data to make it more compressible is at the heart of what the next layer does.
## 2. Discrete Cosine Transform & Quantization
This layer of compression is largely the defining feature of JPEG. After the colors are converted to YCbCr, the components are compressed individually, so we can focus on just the Y component for the rest of the article. Here's what the bytes for the Y component look like with this layer applied.
[DctEditor fullWidth:true ref:"DCT1" comp:"Y" imageUrl:"static/images/blanket-cat-smallest.jpg" maxWidth:400 /]
[aside]
Hint: Try clicking on any pixel in the image to see the line in the editor that represents it. Try removing numbers from the end, or adding a few zeros to any individual number to make its effect more obvious.
[/aside]
At first glance, this seems like very poor compression. There are 100,000 pixels in this image, and yet it takes 102,400 numbers to represent the luminance of each pixel—that's worse than not compressing it at all!
But notice that most of these values are 0. In fact, [EditorLink editor:`refs.DCT1` pattern:"remove-zeros"]all of these trailing zeros can be removed[/EditorLink] with no change to the image. This leaves only about 26,000 numbers, which makes it about four times smaller!
In this layer lies the secret to the checkerboard patterns. Unlike the other effects we've seen, the appearance of these patterns is not a glitch. They are in fact the building blocks of the entire image. Every line in the editor above contains exactly 64 numbers, known as the Discrete Cosine Transform (DCT) coefficients, which correspond to intensities of 64 unique patterns.
These patterns are formed out of cosine waves. Here's what a few of them look like:
[img src:"static/images/dct_coefficients.png"/]
[caption]
These are 8 of the 64 discrete cosine transform coefficients. Credit: [a href:"http://www.jezzamon.com/fourier/index.html"]Jez Swanson[/a].
[/caption]
Below is an image that shows all 64 of them individually.
[data name:"DCT-grid" source:"DCT-grid.json" /]
[data name:"DCT-circle" source:"DCT-circle.json" /]
[data name:"DCT-eye" source:"DCT-eye.json" /]
[DctEditor fullWidth:true ref:"DCT2" imageUrl:"static/images/DCTGrid.jpg" override:DCT-grid isUrlExempt:true/]
These patterns are special because they form a basis for 8x8 images. If you're not familiar with linear algebra, what that means is that any 8x8 image, anything at all that you can imagine, can be made out of these specific 64 patterns. The Discrete Cosine Transform is the process of breaking up the image into 8x8 blocks and converting each block into a combination of these 64 coefficients. Here's [EditorLink editor:`refs.DCT2` pattern:"animate-DCT" DCT:DCT-circle]how you would form a circle[/EditorLink] by combining these patterns, or [EditorLink editor:`refs.DCT2` pattern:"animate-DCT" DCT:DCT-eye]the cat's face[/EditorLink]. You can [EditorLink editor:`refs.DCT2` pattern:"animate-DCT" DCT:DCT-grid]click here to go back to the grid of 64 patterns[/EditorLink].
It seems like magic to say that any image can be represented using 64 specific patterns. But this is the same thing as saying any location on the Earth can be represented using only two numbers: longitude and latitude. We often treat the surface of the Earth as two-dimensional, so only two numbers are needed. An 8x8 image is sixty-four-dimensional, so we need sixty-four numbers.
In terms of compression, it’s not obvious how this helps us. If we need sixty-four numbers to represent an 8x8 image, why is this better than storing the sixty-four luminance components? We do it for the same reason we converted from the three numbers of RGB to the three numbers of YCbCr: it allows us to remove detail that’s less noticeable.
It's hard to see exactly what the details that are removed in this compression step look like because JPEG only applies the Discrete Cosine Transform to blocks of 8x8 pixels at a time. However, there’s no reason we can’t apply it to the whole image. Here’s what it looks like to apply the DCT to the Y component of the entire image:
[FullDctEditor ref:"DCT3" fullWidth:true imageUrl:"static/images/blanket-cat-smallest.jpg" maxWidth:400 /]
We can [EditorLink editor:`refs.DCT3` pattern:"remove-{170}"]remove over 60,000 numbers[/EditorLink] from the end with almost no noticeable change. But notice that if [EditorLink editor:`refs.DCT3` pattern:"zero-{5}"]we set just the first five numbers to zero[/EditorLink] (ignoring the first because it just makes the image darker) there's already an obvious difference.
[Aside]
[Newsletter /]
[/Aside]
The numbers at the beginning represent the lower frequency changes in the image, which our eyes are better at detecting. The numbers towards the end represent the higher frequency changes, which are harder for us to see, so we don't notice when they're gone. To see "what our eyes can't see", we can isolate these high frequency details by setting the [EditorLink editor:`refs.DCT3` pattern:"zero-{5000}"]first 5,000 numbers to zero[/EditorLink].
What you're looking at here is all the areas of the image that have the greatest change from one pixel to the next. The cat's eyes, whiskers, fuzzy blanket and shadows in the bottom left corner all stand out. This can be taken even further, to setting the first [EditorLink editor:`refs.DCT3` pattern:"zero-{10000}"]10,000 numbers to zero[/EditorLink]; [EditorLink editor:`refs.DCT3` pattern:"zero-{20000}"]20,000[/EditorLink]; [EditorLink editor:`refs.DCT3` pattern:"zero-{40000}"]40,000[/EditorLink] or [EditorLink editor:`refs.DCT3` pattern:"zero-{60000}"]60,000[/EditorLink].
These high frequency details are what JPEG removes during this compression step. Converting the colors to the DCT coefficients is not a lossy operation. It's the quantization step that's lossy, where values that are high frequency, close to zero, or both, are removed. When you select a lower quality setting when creating a JPEG image, it increases the threshold for how many of these values are removed, which leads to a smaller file size but a blockier image. This is why the version of the image in the first section that was 57 times smaller looked blocky. Each 8x8 block was represented by far fewer DCT coefficients compared to the higher quality version.
[aside]
One really cool thing you can do with this technique is progressively stream pictures. Imagine seeing a blurry version of the whole image and slowly seeing it become more and more detailed as the download progresses and more DCT coefficients are available. This is actually possible to do with JPEG, but not as commonly used.
[/aside]
Just for fun, here's what it looks like using just [EditorLink editor:`refs.DCT3` pattern:"remove-{268}"]24,000 numbers[/EditorLink], or just [EditorLink editor:`refs.DCT3` pattern:"remove-{316}"]5000 numbers[/EditorLink]. Pretty blurry, but almost recognizable!
## 3. Run-Length, Delta & Huffman Encoding
All the compression steps so far have been lossy. This last layer, by contrast, is lossless. It doesn't remove any information, but it does make the file size significantly smaller.
How do you compress something without throwing away any information? Think about how you would represent a simple solid black image.
[RawEditor ref:"Huffman1" fullWidth:true imageUrl:"static/images/black.jpg" showHeader:false isUrlExempt:true /]
JPEG uses about 5,000 numbers to represent this, but we can do much better. Can you think of an encoding scheme to represent this image using as few bytes as possible?
The smallest I could think of was four bytes: three to specify the color and one to specify how many pixels have this color. The idea of expressing all repeated values concisely this way is called **run-length encoding**. It's lossless because we can recover the encoded data exactly as it was before.
The file size of the solid black JPEG image is much bigger than four bytes because remember that in the DCT layer, the compression is applied to 8x8 blocks at a time. So at minimum we'll need one DCT coefficient for each 64 pixel block. We only need one because instead of storing one DCT coefficient followed by 63 zeros for this image, run-length encoding allows us to just store one number and say "the rest are zero".
**Delta-encoding** is the technique of storing each byte as a relative value compared to something before it instead of storing its absolute value. This is the reason editing certain bytes will change the color for all subsequent pixels. For example, instead of storing:
```
12 13 14 14 14 13 13 14
```
You would start with 12, and from there, just store how much you need to add or subtract to get the next number. So once delta-encoded the sequence above becomes:
```
12 1 1 0 0 -1 0 1
```
Once again, the transformed data is not any smaller than the original, but it is more compressible. Applying delta encoding before run-length can help a lot, while still remaining a completely lossless compression step.
Delta encoding is one of the few techniques that is applied outside the 8x8 blocks. Out of the 64 DCT coefficients, the first one is just a constant wave function (you see it as a solid color). It represents the average brightness of each block for the luminance components, or the average blueness for the Cb components etc. This first value in each DCT block is called the DC value, and each DC value is delta-encoded relative to the ones before it. So changing the brightness of the very first block will affect all blocks in the image.
This all leaves just one final mystery: how can changing just a single number completely wreck the image? This was not a property of any of compression layers so far. The answer lies in the JPEG header. It's the first 500 or so bytes that contain metadata about the image, like its width and height, and has been omitted from all the byte editors so far.
Below is the original image with the header included.
[RawEditor ref:"Huffman2" fullWidth:true imageUrl:"static/images/blanket-cat.jpg" maxWidth:700 /]
Without the header, it's practically impossible (or at least very difficult) to decode the JPEG image. It would be as if I was trying to describe a painting to you, and I started to invent words to communicate what I saw. It's probably going to be a very concise description, since I can define the words to mean exactly what I want to communicate, but it would be meaningless to anyone other than me.
This may sound ridiculous, but this is exactly what's going on here. Every single JPEG image is compressed with a code that's specific to this particular image. These codes are defined in a dictionary stored in the header. This technique is called **Huffman encoding**, and the dictionary is called a Huffman table. This table is marked in the header by two bytes: 255 followed by 196. Each color component may have its own Huffman table.
Changes to these Huffman tables will have the most dramatic effects on any image. Changing [EditorLink editor:`refs.Huffman2` line:15 search:'125 1' replace:'125 12']the second 1 to 12 on line 15[/EditorLink] is a good example. Changing anything after the 125 on that line works too.
The Huffman tables have such a dramatic effect on the image because they tell us how to read the individual bits. So far we've just been dealing with the binary numbers in decimal. This hides the fact that if you want to store the number [i]1[/i] in a byte, it would look like _00000001_, because each byte must have exactly eight bits even if it only needs one bit.
This is potentially a huge waste of storage if you have a lot of small numbers. Huffman encoding is a technique that allows us to relax this requirement that each number must occupy eight bits. That means if you see the two bytes:
```
234 115
```
Based on the Huffman table, these could actually be three values. To extract them, you'll need to first break them into their individual bits:
```
11101010 01110011
```
[aside]
One neat trick you can do with this knowledge is strip out the header from a JPEG image and save it separately. You're effectively making it so only you can read it. [a href:"https://code.fb.com/android/the-technology-behind-preview-photos/" target:"_blank"]Facebook actually does this[/a] to make JPEG images even smaller.
Another thing you can do is change the Huffman table just slightly. To anyone else, it looks like a corrupted image. But only you would know the magic edit needed to fix it.
[/aside]
Then follow the table to figure out how to group them. For example, it could be the first six bits (111010) which is 58 in decimal, followed by another five bits (10011) which is 19 and finally the last four bits (0011), which is three.
[div]
This is why it's very difficult to make sense of the bytes at this layer of compression. The bytes don't actually represent what they seem to represent. I won't go into the details of how to extract the Huffman table and translate the bits in this article, but there are [Reference content:"Tom Scott's <a href='https://www.youtube.com/watch?v=JsTptu56GM8'>video on Huffman Encoding</a> is a great explanation of how it works in general. This <a href='https://www.impulseadventure.com/photo/jpeg-huffman-coding.html'>article by Calvin Hass at ImpulseAdventure</a> goes into how to extract the Huffman tables from a JPEG image and use it for decoding."]many good resources on this if you're curious[/Reference].
[/div]
[hr/]
So to summarize, what all does it take to decode a JPEG image? You need to:
1. Extract the Huffman table(s) from the header and decode the bits.
2. Extract the Discrete Cosine Transform coefficients for each color/luminance component, for each 8x8 block, by undoing the run-length and delta encodings.
3. Combine the cosine waves based on the coefficients to get back the pixel values for each 8x8 block (this is known as the inverse Discrete Cosine Transform).
4. Scale up the chrominance components if they were subsampled (the header has this information).
5. Convert the resulting YCbCr of each pixel to RGB.
6. Display the image!
That's a lot of work to view a simple cat picture! But what I love about this is that you can see how JPEG is a very human-centric technology. It relies on the quirks of our perception to achieve compression rates far greater than is possible with general purpose techniques. And now that you understand how JPEG works, you can imagine how many of these techniques can be extended to other domains. For example, applying delta-encoding in video can produce a huge file size reduction since there are often areas that don't change at all between frames (such as the background).
All the code for this article is [a href:"https://github.com/ParametricPress/01-unraveling-the-jpeg"]open source[/a] and includes instructions on replacing the images in these byte editors with your own.
[AuthorBio]
[b][a href:"https://omarshehata.me/"]Omar Shehata[/a][/b] is a graphics programmer at Cesium working on open source, web-based 3D maps. He grew up in Alexandria, Egypt and currently lives in Philadelphia, PA.
Edited by Matthew Conlen and Victoria Uren.
[/AuthorBio]
[NextArticle slug:parametricSlug fullWidth:true /]
[Footer fullWidth:true /]
[Initialize /]
[Analytics google:"UA-139053456-1" tag:parametricSlug /]