forked from neuroradiology/InsideReCaptcha
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Marin
committed
Dec 9, 2014
0 parents
commit bfecbbe
Showing
5 changed files
with
1,050 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# Summary | ||
|
||
A few days ago, Google has introduced a [new version of ReCaptcha](http://googleonlinesecurity.blogspot.fr/2014/12/are-you-robot-introducing-no-captcha.html), theorically allowing most users to complete it by only ticking a checkbox. If the user isn't deemed as human by Google, the old version with distorted text appears. Although I used a normal Firefox version, I still had to fill the text captcha after clicking, so it didn't really worked for me. My curiosity induced me to look at the JavaScript in order to know how all this really works... | ||
|
||
# What happens on the wire | ||
|
||
First, the browser makes the few following requests: | ||
|
||
* `https://www.google.com/recaptcha/api.js`, whose function is mainly to load the next one... | ||
* `https://www.gstatic.com/recaptcha/api2/r20141202135649/recaptcha__en.js`, which contains common code. | ||
* `https://apis.google.com/_/scs/apps-static/_/js/` (followed by a bunch of more or less cryptic parameters) which contains other common JavaScript code. | ||
|
||
The browser then makes a requests to `https://www.google.com/recaptcha/api2/anchor`, whose response with contains the very interesting stuff: a callback to a function called `recaptcha.anchor.Main.init`, which contains two base64-encoded parameters. | ||
|
||
The first parameter points to a JavaScript file: [`https://www.google.com/js/bg/6yg-ggdQgQAg8SAADJkAjc-JMNnOnYuIGgH_iBV7uf8.js`](https://www.google.com/js/bg/6yg-ggdQgQAg8SAADJkAjc-JMNnOnYuIGgH_iBV7uf8.js). The second one contains *double-*base64-encoded binary data. | ||
|
||
It turned out this new ReCaptcha system is heavily obfuscated, as **Google implemented a whole VM in JavaScript with a specific bytecode language**. | ||
|
||
The first parameter is the bytecode interpreter. After trimming the `(function(){eval('` and `')})()`, and passing it to [JSBeautifier](http://jsbeautifier.org/), I finally dove in this mass of minified code. | ||
|
||
# The analysis | ||
|
||
The interpreter has two entry points: the `M` function which is executed when ReCaptcha is loaded, and `M.prototype.ha` which is executed when you click the checkbox, and that returns the information for Google servers. | ||
|
||
I first discovered that the bytecode was encrypted using the [XTEA](https://en.wikipedia.org/wiki/XTEA) algorithm. Each block of 8 bytes is xored with a keystream (so decryption and encryption functions are the same), where the first 32-bit word of ciphertext is read from the bytecode file, the second 32-bit word is the position in the bytecode file divided by 8, and the key is *by default* `[0, 0, 0, 0]`. | ||
|
||
By default... because it would have been too simple: it turns out the bytecode has direct access to JavaScript variables of its *own* interpreter, and changes its *own* decryption key and even its *own* opcodes numbers at many points. | ||
|
||
Even more nifty, the bytecode key is once generated by directly hashing JavaScript code from the interpreter (`Function.toString()` rocks, it doesn't?), or with the output of browser-specific functions and CSS rules, or with the hostname of the calling domain (www.google.com)... | ||
|
||
**After a few about 2 days of work, I produced a working disassembler and then decompiler for the ReCaptcha bytecode.** You can try it from this GitHub repository. However, it stills has some hardcoded keys values, so it will only work on the bytecode sample contained in the `enc` file for now. | ||
|
||
Just execute the `./decomp.py` file to give it a try, it will output pseudo-JavaScript. `xhr1` and `xhr2` are byte arrays that contains the data later sent to Google servers. | ||
|
||
# Gathered information | ||
|
||
Google servers will receive and process, at least, the following information: | ||
|
||
* Plug-ins | ||
* User-agent | ||
* Screen resolution | ||
* Execution time, timezone | ||
* Number of click/keyboard/touch actions *in the `<iframe>` of the captcha* | ||
* It tests the behavior of many browser-specific functions and CSS rules | ||
* It checks the rendering of canvas elements | ||
* Likely cookies server-side (it's executed on the www.google.com domain) | ||
* And likely other stuff... | ||
|
||
You can at the decompiled bytecode for more precision. | ||
|
||
This information, along with numeric values hardcoded in the bytecode (forcing a potential bot to read all of it), is send to the `https://www.google.com/recaptcha/api2/frame` page. Look at the `M.prototype.Q` function to see how the encoding process is realized. Some of information (the one I call `xhr2` in the decompiler, which is retrieved in the `this.c[this.g]` variable − `xhr1` is in `this.c[this.d]`) is also encrypted with XTEA. | ||
|
||
# What next... | ||
|
||
We could: | ||
|
||
* Make statistics about when the checkbox-captcha suffices and when it doesn't. | ||
* Programmatically bypassing the captcha by interpreting bytecode. | ||
* Programmatically bypassing the captcha by simply executing a rendering engine and automating movements of the mouse. But it would be slighty less funny. | ||
|
||
Cheers and good reversing! |
Oops, something went wrong.