splotnikv · vshampor · Apr 20, 2018 · May 15, 2018 · May 15, 2018 · Jul 2, 2018
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "vshampor/deshuffler/unit_tests/googletest"]
+	path = vshampor/deshuffler/unit_tests/googletest
+	url = https://github.com/google/googletest.git
diff --git a/vshampor/SAS.md b/vshampor/SAS.md
@@ -0,0 +1,57 @@
+# "deshuffler"
+## Software Architecture Specification
+##### Vasily Shamporov, Apr 2017
+
+### Overview
+The program is written in C++ (with the support of C++14 standard). The basic program control flow is presented on the figure below:
+
+![alt text](control_flow.png)
+
+The input YUV file , which has every frame (except the first one) shuffled in random order on the basis of 64x64 blocks, is first opened for reading; next, for each frame the data which describes the correct position of each shuffled tile on the unshuffled frame ("permutation data") is calculated. Afterwards (optionally) the original unshuffled stream is completely reconstructed and output to the disk using the input shuffled stream and the permutation data calculated in the previous step. The calculation of permutation data is based on motion estimation between consecutive frames of the input YUV stream. More details on some of the steps of the algorithm follow.
+
+### Details
+##### Calculate permutation for the stream
+![alt text](perm_gen.png)
+This step incorporates frame-level parallelism to improve performance - the input stream is divided into M equal batches, with consecutive frame sequences in each batch, and each part is assigned a worker thread. Each worker thread then calculates permutation data between pairs of consecutive frames inside their batch, starting from the first one in display order.
+
+The batch containing the first, unshuffled frame and the corresponding worker thread (hereafter "primary" thread) are of special interest. Non-primary threads will calculate permutation data between pairs of shuffled frames, wherefore the primary thread is able to always calculate permutations between a shuffled frame and a reconstructed preceding frame, since its batch has the first, unshuffled frame. Hence, the permutation data produced by non-primary threads will only be relative to the first frames  of their respective batches, while permutation data produced by the primary thread will be absolute. An additional post-processing step is therefore required to produce absolute permutation data for the whole stream.
+
+It is assumed that motion estimation between a shuffled frame and an unshuffled one will be more effective in producing correct permutation data than motion estimation between two shuffled, although consecutive frames and the calculation of permutation data for some of the frames in the non-primary thread batches may fail (see below for more details on the failure status assignment). To address this, the failed frames from each non-primary thread are aggregated, and then, after all threads have finished their calculations, the failed frames are processed in sequential order while using reconstructed preceding frames (which should be available by this moment of time, either as video data or absolute permutation data), and the correct permutation data is calculated for these frames.
+
+##### Calculate permutation for a sequential frame batch
+![alt text](perm_batch.png)
+As stated above, each worker thread processes its own batch of sequential frames starting with the first pair of consecutive frames in display order. Calculating permutation data between two frames is performed using FEI PREENC, which performs motion estimation on a 16x16 block basis, while shuffled tiles have a size of 64x64 pixels. Theoretically, it is sufficient to only perform motion estimation for a single 16x16 block inside the 64x64 tile to calculate the tile position on the preceding frame. This may be prone to errors, but brings obvious performance gain; therefore, as a first step, for each pair of consecutive frames (K_(i - 1), K_i) a pair of special frames (S_(i - 1), S_i) is constructed by taking a 16x16 block from the center of each 64x64 tile and putting them side-by-side in the same raster scan order as for the original frames. The permutation data is then calculated for frames (S_(i - 1), S_i). If this fails, the algorithm falls back to motion estimation on the full-res frames (K_(i - 1), K_i). If this fails as well (if, for example, it was not possible to reconstruct frame K_(i - 1)), then the whole frame K_i is assigned a failure status and the processing progresses to the next pair of frames in the batch. It is assumed that the primary thread should not fail at this point, otherwise deshuffling as a whole fails since no other means to improve the motion estimation accuracy are included in the algorithm.
+
+##### Calculate permutation for a frame pair
+![alt text](perm_pair.png)
+When permutation data is calculated for two frames A and B, one of them serves as a reference for the other in terms of motion estimation. Let A be the reference frame - depending on the situation, it may already have absolute permutation data (calculated previously by the primary thread), relative permutation data (calculated previously by a non-primary thread), or no permutation data at all (if motion estimation by a non-primary thread failed, or frame A is the first one in a batch belonging to a non-primary thread). If frame A has absolute permutation data, then frame B will be assigned absolute permutation data after PREENC run as well, and it is marked as such. Otherwise, frame B is marked as having relative permutation data.
+
+Next, PREENC is run on frames A and B with A as reference. The output of PREENC is a map of (multiple) motion vectors per each 16x16 block of the frame and corresponding distortion values. Afterwards, if frames A and B were down-sized using the algorithm described in the previous section, a single best motion vector is selected for each 16x16 block (representing a 64x64 tile on the full-resolution frame); otherwise, if frames A and B had full resolution, a single best motion vector is selected for each 64x64 tile. Either way, at this point a per-tile map of motion vectors is produced for frame B relative to frame A. If this map specifies a valid permutation of tiles (i.e. no two MVs point to the same tile on frame A), then the calculation is deemed successful and actual permutation data is computed and assigned to frame B; a success status is returned. Otherwise, the calculation is deemed a failure - no permutation data is computed and a failure status is returned.
+
+###### PREENC call specifics
+As stated above, PREENC works on a 16x16 block basis. However, the range of produced MVs is limited by the PREENC window size (roughly 128x96 pixels) - see picture below:
+
+![alt text](preenc_single.png)
+
+ For our purposes the desired MVs (specifying the tile permutation) may be larger than the PREENC window size - as large as the frame width/height. In order to ensure that each 16x16 block is being searched for across the whole frame, PREENC will be called multiple times on the same pair of frames, but each time with a different "offset vector map" - a 2D-array of vectors (x;y), one for each 16x16 block, which specify offsets of the PREENC search window from the center of the 16x16 block.
+
+ The number of PREENC calls is determined based on the frame size and the PREENC window size. The principle is to break the frame into an integer number of equal search areas, each having width and height equal to PREENC window size; the number of PREENC calls will be equal to the number of the search areas. By this time, the frame size is aligned by 16 pixels, but not aligned by the search area size, so the search areas will be overlapping, as illustrated in the following picture, which has 12 search areas (red dots correspond to the centers of the search areas):
+
+ ![alt text](preenc.png)
+
+For each PREENC call corresponding to one search area the offset vector map is constructed in the next way - for each 16x16 block on the frame the offset vector is drawn from the center of the block to the center of the search area. This is illustrated on the picture below (only the offset vectors for the first 9 top-left blocks are shown):
+
+![alt text](preenc_map.png)
+
+The resulting motion vectors and distortion values from each call are aggregated per-16x16 block and passed higher up the architecture for purposes of finding the ultimate per-64x64 tile motion vector map.
+
+Since each PREENC call associated with a search area is independent from the others, these calls can be distributed among threads, achieving, roughly speaking, a "search-area parallellism".
+
+###### Checking the per-tile MV map for consistency
+Determining whether the per-tile MV map specifies a valid permutation of tiles is performed in the following way: first, a 2-D array of M x N boolean values `bool hitmap[M][N]` is allocated (where M and N are width and height of the frame in tile units respectively) and each boolean value is initialized to false. Next, per-tile motion vectors are processed in tile raster scan order; the coordinates N_x, N_y (in tile units) of the "target" tile , i.e. the tile where the motion vector points to when centered on the tile it belongs to ("source tile"), are calculated. If `hitmap[N_x][N_y]` is `false`, then it is set to `true` to mark that the corresponding "target" tile has been associated with one of the "source" tiles. If `hitmap[N_x][N_y]` is already `true`, the MV map is deemed as not specifying valid permutation data. Otherwise, if, after processing all per-tile MVs there has not been a situation where `hitmap[N_x][N_y]` is already ` true`, the MV map is deemed as specifying valid permutation data. The complexity of this algorithm is O(M * N) in computations and O(M * N) in memory.
+
+##### Permutation data
+The permutation data format for frame B relative to frame A is simple - it is a list of integers (one integer for each tile of frame B in raster scan order), each one representing a position of the corresponding tile on frame A in raster scan order.
+
+##### Reconstructing the original stream
+Since by the time the original stream reconstruction step is executed the absolute permutation data is known (i.e. each frame can be reconstructed using only its own pixel data and the permutattion data), this step is easily parallelizable on the pixel-level - basically, a single thread may be assigned to each tile to be replaced.
diff --git a/vshampor/control_flow.png b/vshampor/control_flow.png
diff --git a/vshampor/deshuffler/.gitignore b/vshampor/deshuffler/.gitignore
@@ -0,0 +1,20 @@
+/Debug/
+*.yuv
+CMakeFiles/*
+bin/*
+*/bin/*
+*/CMakeCache.txt
+CMakeCache.txt
+CMakeFiles
+CMakeScripts
+Testing
+Makefile
+cmake_install.cmake
+install_manifest.txt
+compile_commands.json
+CTestTestfile.cmake
+lib/*
+sample_common/lib/*
+*.pc
+/build/
+*.lib
diff --git a/vshampor/deshuffler/CMakeLists.txt b/vshampor/deshuffler/CMakeLists.txt
@@ -0,0 +1,37 @@
+cmake_minimum_required (VERSION 3.11)
+project (deshuffler)
+if (NOT CMAKE_BUILD_TYPE)
+    message(STATUS "No build type selected, default to Debug")
+    set(CMAKE_BUILD_TYPE "Debug")
+endif()
+
+set (CMAKE_CXX_STANDARD 11)
+
+set(CMAKE_BINARY_DIR bin)
+set(EXECUTABLE_OUTPUT_PATH ${CMAKE_BINARY_DIR})
+set(LIBRARY_OUTPUT_PATH lib)
+
+include_directories(include)
+include_directories(msdk_api/include)
+include_directories(sample_common/include)
+
+add_subdirectory(unit_tests)
+
+add_subdirectory(sample_common)
+
+add_library(deshuffler STATIC
+		src/deshuffler.cpp
+		src/input_params.cpp
+		src/permutation_data.cpp
+		src/yuv_reader_seek_i420.cpp)
+add_dependencies(deshuffler sample_common)
+target_link_libraries(deshuffler sample_common)
+
+add_executable(deshuffler_cl src/main.cpp)
+add_dependencies(deshuffler_cl deshuffler)
+target_link_libraries(deshuffler_cl deshuffler)
+
+add_custom_target(run_tests ALL
+    DEPENDS deshuffler
+    COMMAND unit_tests
+    WORKING_DIRECTORY unit_tests/bin/)
diff --git a/vshampor/deshuffler/include/deshuffler.h b/vshampor/deshuffler/include/deshuffler.h
@@ -0,0 +1,27 @@
+#ifndef DESHUFFLER_H_
+#define DESHUFFLER_H_
+
+#include "permutation_data.h"
+#include "input_params.h"
+#include "yuv_reader_seek_i420.h"
+#include "permut_calc_task.h"
+#include <sample_utils.h>
+
+class Deshuffler
+{
+public:
+    Deshuffler() = default;
+    Deshuffler(const InputParams& params) : m_params(params) {}
+    void CalculatePermutation();
+    void OutputPermutation();
+    void ReconstructStream();
+    void OutputStream();
+private:
+    std::vector<PermutCalcTask> GeneratePermutCalcTasks();
+    mfxStatus CalculatePermutCalcTask(PermutCalcTask& task);
+    InputParams m_params;
+    PermutationData m_permutation_data;
+    CSmplYUVWriter m_YUVWriter;
+};
+
+#endif /* DESHUFFLER_H_ */
diff --git a/vshampor/deshuffler/include/input_params.h b/vshampor/deshuffler/include/input_params.h
@@ -0,0 +1,26 @@
+
+#ifndef INPUT_PARAMS_H_
+#define INPUT_PARAMS_H_
+
+#include <mfxdefs.h>
+#include <string>
+
+struct StreamInfo
+{
+    std::string filename;
+    mfxU32 width;
+    mfxU32 height;
+    mfxU32 frame_count;
+};
+
+class InputParams
+{
+public:
+    InputParams() = default;
+    InputParams(int argc, char* argv[]);
+    mfxU32 thread_count = 8;
+
+};
+
+
+#endif /* INPUT_PARAMS_H_ */
diff --git a/vshampor/deshuffler/include/permutation_data.h b/vshampor/deshuffler/include/permutation_data.h
@@ -0,0 +1,10 @@
+#ifndef PERMUTATION_DATA_H_
+#define PERMUTATION_DATA_H_
+
+class PermutationData
+{
+public:
+    PermutationData();
+};
+
+#endif /* PERMUTATION_DATA_H_ */
diff --git a/vshampor/deshuffler/include/yuv_reader_seek_i420.h b/vshampor/deshuffler/include/yuv_reader_seek_i420.h
@@ -0,0 +1,18 @@
+#ifndef SRC_YUV_READER_SEEK_H_
+#define SRC_YUV_READER_SEEK_H_
+
+#include <sample_utils.h>
+#include <string>
+#include <input_params.h>
+
+class YUVReaderSeekI420: public CSmplYUVReader
+{
+public:
+    mfxStatus Init(const StreamInfo& stream_info);
+    void Seek(mfxU32 frame_number);
+protected:
+    mfxU32 m_width = 0;
+    mfxU32 m_height = 0;
+};
+
+#endif /* SRC_YUV_READER_SEEK_H_ */
diff --git a/vshampor/deshuffler/msdk_api/include/mfxastructures.h b/vshampor/deshuffler/msdk_api/include/mfxastructures.h
@@ -0,0 +1,162 @@
+// Copyright (c) 2017 Intel Corporation
+// 
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+// 
+// The above copyright notice and this permission notice shall be included in all
+// copies or substantial portions of the Software.
+// 
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+// SOFTWARE.
+#ifndef __MFXASTRUCTURES_H__
+#define __MFXASTRUCTURES_H__
+#include "mfxcommon.h"
+
+#if !defined (__GNUC__)
+#pragma warning(disable: 4201)
+#endif
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif /* __cplusplus */
+
+/* CodecId */
+enum {
+    MFX_CODEC_AAC         =MFX_MAKEFOURCC('A','A','C',' '),
+    MFX_CODEC_MP3         =MFX_MAKEFOURCC('M','P','3',' ')
+};
+
+enum {
+    /* AAC Profiles & Levels */
+    MFX_PROFILE_AAC_LC          =2,
+    MFX_PROFILE_AAC_LTP         =4,
+    MFX_PROFILE_AAC_MAIN        =1,
+    MFX_PROFILE_AAC_SSR         =3,
+    MFX_PROFILE_AAC_HE          =5,
+    MFX_PROFILE_AAC_ALS         =0x20,
+    MFX_PROFILE_AAC_BSAC        =22,
+    MFX_PROFILE_AAC_PS          =29,
+
+    /*MPEG AUDIO*/
+    MFX_AUDIO_MPEG1_LAYER1      =0x00000110, 
+    MFX_AUDIO_MPEG1_LAYER2      =0x00000120,
+    MFX_AUDIO_MPEG1_LAYER3      =0x00000140,
+    MFX_AUDIO_MPEG2_LAYER1      =0x00000210,
+    MFX_AUDIO_MPEG2_LAYER2      =0x00000220,
+    MFX_AUDIO_MPEG2_LAYER3      =0x00000240
+};
+
+/*AAC HE decoder down sampling*/
+enum {
+    MFX_AUDIO_AAC_HE_DWNSMPL_OFF=0,
+    MFX_AUDIO_AAC_HE_DWNSMPL_ON= 1
+};
+
+/* AAC decoder support of PS */
+enum {
+    MFX_AUDIO_AAC_PS_DISABLE=   0,
+    MFX_AUDIO_AAC_PS_PARSER=    1,
+    MFX_AUDIO_AAC_PS_ENABLE_BL= 111,
+    MFX_AUDIO_AAC_PS_ENABLE_UR= 411
+};
+
+/*AAC decoder SBR support*/
+enum {
+    MFX_AUDIO_AAC_SBR_DISABLE =  0,
+    MFX_AUDIO_AAC_SBR_ENABLE=    1,
+    MFX_AUDIO_AAC_SBR_UNDEF=     2
+};
+
+/*AAC header type*/
+enum{
+    MFX_AUDIO_AAC_ADTS=            1,
+    MFX_AUDIO_AAC_ADIF=            2,
+    MFX_AUDIO_AAC_RAW=             3,
+};
+
+/*AAC encoder stereo mode*/
+enum 
+{
+    MFX_AUDIO_AAC_MONO=            0,
+    MFX_AUDIO_AAC_LR_STEREO=       1,
+    MFX_AUDIO_AAC_MS_STEREO=       2,
+    MFX_AUDIO_AAC_JOINT_STEREO=    3
+};
+
+typedef struct {
+    mfxU32                CodecId;
+    mfxU16                CodecProfile;
+    mfxU16                CodecLevel;
+
+    mfxU32  Bitrate;
+    mfxU32  SampleFrequency;
+    mfxU16  NumChannel;
+    mfxU16  BitPerSample;
+
+    mfxU16                reserved1[22]; 
+
+    union {    
+        struct {   /* AAC Decoding Options */
+            mfxU16       FlagPSSupportLev;
+            mfxU16       Layer;
+            mfxU16       AACHeaderDataSize;
+            mfxU8        AACHeaderData[64];
+        };
+        struct {   /* AAC Encoding Options */
+            mfxU16       OutputFormat;
+            mfxU16       StereoMode;
+            mfxU16       reserved2[61]; 
+        };
+    };
+} mfxAudioInfoMFX;
+
+typedef struct {
+    mfxU16  AsyncDepth;
+    mfxU16  Protected;
+    mfxU16  reserved[14]; 
+
+    mfxAudioInfoMFX   mfx;
+    mfxExtBuffer**    ExtParam;
+    mfxU16            NumExtParam;
+} mfxAudioParam;
+
+typedef struct {
+    mfxU32  SuggestedInputSize;
+    mfxU32  SuggestedOutputSize;
+    mfxU32  reserved[6];
+} mfxAudioAllocRequest;
+
+typedef struct {
+    mfxU64  TimeStamp; /* 1/90KHz */
+    mfxU16  Locked;
+    mfxU16  NumChannels;
+    mfxU32  SampleFrequency;
+    mfxU16  BitPerSample;
+    mfxU16  reserved1[7]; 
+
+    mfxU8*  Data;
+    mfxU32  reserved2;
+    mfxU32  DataLength;
+    mfxU32  MaxLength;
+
+    mfxU32  NumExtParam;
+    mfxExtBuffer **ExtParam;
+} mfxAudioFrame;
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
+
+#endif
+
+