diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md index 4503fac..f70e60a 100644 --- a/Project2-Character-Recognition/README.md +++ b/Project2-Character-Recognition/README.md @@ -3,12 +3,214 @@ CUDA Character Recognition **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yan Dong + - [LinkedIn](https://www.linkedin.com/in/yan-dong-572b1113b/) + - [personal website](coffeier.com) + - [github](https://github.com/coffeiersama) +* Tested on: Windows 10, i7-8750 @ 2.22GHz (12CPUs) 16GB, GTX 1060 14202MB (OMEN 15-dc0xxx) -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +[Result](#result) - [Rules](#rules) - [Runtime Analysis](#analysis) - [Extra](#extra) + + + +## Rules + +##### I build a simple neuro network, it has: + +- Input layer-the image data come in here, now is 10201 neuro based on the provide data. +- one hidden layer-the number of the neuro can be changed. We calculate the intermediate quantity with weights and then do a sigmoid function, then be the input of the next layer. +- Output layer, the neuro count is 52 based on there are 52 classes at present. + +##### Loss function: + +​ There are 2 loss calculation, the first one is MSE, which means just calculate the difference between the prediction and the real label and then square it and multiplied by 0.5. + +![](img/eq1.png) + +​ The second one is entropy_cross method, here is the equation: yi and yi' is the prediction and the ground truth. + +![](img/eq2.png) + +##### Forward Pass + +- input data into input layer + +- Multiply neuro data in this layer with weights(full connected) + +- get hidden layer data + +- sigmoid + +- Multiply neuro data in this layer with weights(full connected) + +- sigmoid + +- softmax + +- get output + + + +##### Back Propagation + +- calculate the error between ground truth and output +- softmax derivative, pass backward +- sigmoid derivative +- multiply with weights, pass backward and get the hidden-output gradient +- sigmoid derivative +- multiply with weights, get the input-hidden gradient +- update the weights + + + +## Analysis + +![](img/epi2.png) + +As shown in the graph, when we increase the hidden layer size, sometimes the episode goes down, sometimes increase, there is no obvious stable tendency for my network. + +![](img/acc.png) + +after training more times, the accuracy increase. + +## Result + +###### There are separate training and testing part + +you can set the loss threshold you want, + +for instance, 0.5, and you input the training time you want. + +Then we get training, if the final loss is less than the threshold, + +we can go into the test part. + +###### There are two test way, + +the first is randomly pick a picture to recognize, do it several times, + +another is just follow the order(maybe forward or backward, here I use forward) + +to make 52 tests. + +I print each recognize test result. + +###### I print the weights in the folder, see hid_out_w.txt and input_hid_w,txt + +``` +**************** +** Recognize TESTS ** +**************** +successfully build the network! +epoch 0 loss : 5.83244 +epoch 1 loss : 4.90023 +epoch 2 loss : 4.26333 +epoch 3 loss : 3.7257 +epoch 4 loss : 3.25222 +epoch 5 loss : 2.82764 +epoch 6 loss : 2.44545 +epoch 7 loss : 2.10429 +epoch 8 loss : 1.80387 +epoch 9 loss : 1.54303 +epoch 10 loss : 1.31975 +epoch 11 loss : 1.13122 +epoch 12 loss : 0.973897 +epoch 13 loss : 0.843676 +epoch 14 loss : 0.73633 +epoch 15 loss : 0.647869 +epoch 16 loss : 0.574772 +epoch 17 loss : 0.514071 +epoch 18 loss : 0.463332 +epoch 19 loss : 0.420601 +epoch 20 loss : 0.384323 +epoch 21 loss : 0.353275 +epoch 22 loss : 0.326494 +epoch 23 loss : 0.30322 +epoch 24 loss : 0.282853 +epoch 25 loss : 0.264912 +epoch 26 loss : 0.249011 +epoch 27 loss : 0.23484 +epoch 28 loss : 0.222143 +epoch 29 loss : 0.210712 +epoch 30 loss : 0.200374 +epoch 31 loss : 0.190984 +epoch 32 loss : 0.182422 +epoch 33 loss : 0.174586 +epoch 34 loss : 0.167391 +epoch 35 loss : 0.160762 +epoch 36 loss : 0.154637 +epoch 37 loss : 0.148961 +epoch 38 loss : 0.143687 +epoch 39 loss : 0.138776 +pic: 1 pred: 1 +pic: 2 pred: 2 +pic: 3 pred: 3 +pic: 4 pred: 4 +pic: 5 pred: 5 +pic: 6 pred: 6 +pic: 7 pred: 7 +pic: 8 pred: 8 +pic: 9 pred: 9 +pic: 10 pred: 10 +pic: 11 pred: 11 +pic: 12 pred: 12 +pic: 13 pred: 13 +pic: 14 pred: 14 +pic: 15 pred: 15 +pic: 16 pred: 16 +pic: 17 pred: 17 +pic: 18 pred: 18 +pic: 19 pred: 19 +pic: 20 pred: 20 +pic: 21 pred: 21 +pic: 22 pred: 22 +pic: 23 pred: 23 +pic: 24 pred: 24 +pic: 25 pred: 25 +pic: 26 pred: 26 +pic: 27 pred: 27 +pic: 28 pred: 28 +pic: 29 pred: 29 +pic: 30 pred: 30 +pic: 31 pred: 31 +pic: 32 pred: 32 +pic: 33 pred: 33 +pic: 34 pred: 34 +pic: 35 pred: 35 +pic: 36 pred: 36 +pic: 37 pred: 37 +pic: 38 pred: 38 +pic: 39 pred: 39 +pic: 40 pred: 40 +pic: 41 pred: 41 +pic: 42 pred: 42 +pic: 43 pred: 43 +pic: 44 pred: 44 +pic: 45 pred: 45 +pic: 46 pred: 46 +pic: 47 pred: 47 +pic: 48 pred: 48 +pic: 49 pred: 49 +pic: 50 pred: 50 +pic: 51 pred: 51 +pic: 52 pred: 52 +test time: 52 ,correct probability: 100% +``` + + + +## Extra + +I use the cublas to do the forward pass and the back propagation calculation as a matrix multiply. + +The input of the txt data can be a matrix of *1 x 10201* + +and the input_hid_weight is *10201 x hid_layer_size* + +the hid layer data is *1 x hid_layer_size* + +the hid_output_weight is *hid_layer_size x output_layer_size* + +the output is *1 x output_layer_size* \ No newline at end of file diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt index 7446175..9e487b9 100644 --- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt +++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt @@ -7,5 +7,5 @@ set(SOURCE_FILES cuda_add_library(character_recognition ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_60 ) diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu index 5a3ed7f..a1bedf3 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.cu +++ b/Project2-Character-Recognition/character_recognition/mlp.cu @@ -1,27 +1,623 @@ +#include +#include #include #include +#include +#include #include "common.h" #include "mlp.h" namespace CharacterRecognition { - using Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - // TODO: __global__ - - /** - * Example of use case (follow how you did it in stream compaction) - */ - /*void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - */ - - // TODO: implement required elements for MLP sections 1 and 2 here + + using Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + //TODO: implement required elements for MLP sections 1 and 2 here +#define blockSize 128 + +//neural number + int input_count;//size of input + int SIZE_INPUT;// the size of input layer + int SIZE_HiD;// the size of hidden layer + int SIZE_OUTPUT;//the size of output class + + //data pointer + float *dev_input; + float *dev_hid; + float *dev_output; + //weights + float *weights_inandhid;// the input->hidden layer weights + float *weights_hidandoutput;// the hidden->output layer weights + float *dev_real; + float *hid_sig; + float *out_sig; + float *out_soft; + //gradient + float *wgrad_i2h; + float *wgrad_h2o; + + //learning rate + float lr; + //loss + float loss_threshold; + + //function variable + float exp_sum; + + bool flag = false; + + //init an array + __global__ void Init(int n, float *data, float value) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + data[index] = value; + } + + void printM(const float *M, + int row, int col) { + for (int i = 0; i < row; ++i) { + for (int j = 0; j < col; ++j) { + printf("%f ", M[j * row + i]); + } + printf("\n"); + } + printf("\n"); + } + + void Write_Weights2File(std::string filename, + const float *M, int row, int col) { + //std::cout << "print weights" << std::endl; + std::ofstream file(filename); + if (file.is_open()) { + for (int i = 0; i < row; i++) { + for (int j = 0; j < col; j++) { + //std::cout<= n) { + return; + } + odata[index] = 1.0 / (1.0 + expf(-idata[index])); + } + + /////////////////////////////softmax//////////////////////////////////// + //e^x + __global__ void kernExp(int n, const float *idata, float* odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + odata[index] = expf(idata[index]); + } + + float Get_sum(int n, float *arr) { + float sum = 0.0; + for (int i = 0; i < n; i++) { + sum += arr[i]; + } + return sum; + } + + //normalize + __global__ void kernNormalize(int n, float arr_sum, + const float *arr_exp, float *odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + odata[index] = arr_exp[index] / arr_sum; + } + + void softmax(int n, float* idata, float* odata) { + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + //get exp for the whole input + float* arr_exp; + cudaMalloc((void**)&arr_exp, sizeof(float) * n); + checkCUDAError("cudaMalloc arr_exp failed!"); + Init << > > (n, arr_exp, 0.0); + kernExp << > > (n, idata, arr_exp); + + //get sum + float* exp = new float[n]; + cudaMemcpy(exp, arr_exp, sizeof(float) * n, cudaMemcpyDeviceToHost);//get + /*std::cout << "exp: " << std::endl; + printM(exp, n, 1);*/ + exp_sum = Get_sum(n, exp); + + //normalize + kernNormalize << > > (n, exp_sum, arr_exp, odata); + + cudaFree(arr_exp); + free(exp); + } + + //finally get the judgemenet + //return max_index + 1 so it is from 1 - n not from 0 to n-1 + int GetfinalJudge(int n, float* after_softmax) { + int max_index = -1; + float max_prob = 0.0; + for (int i = 0; i < n; i++) { + if (max_prob < after_softmax[i]) { + max_prob = after_softmax[i]; + max_index = i; + } + } + return max_index + 1; + } + + /////////////////////////////loss function//////////////////////////////////// + __global__ void kernmse_loss(int n, const float* real, + const float* predict, float* each) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + each[index] = 0.5 * (real[index] - predict[index]) * (real[index] - predict[index]); + } + + __global__ void kerncross_entropy(int n, const float* real, + const float* predict, float* each) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + each[index] = -1.0 *(real[index] * logf(predict[index])); + } + + //loss + float compute_loss(const float* real, const float *pred, int ind) { + //std::cout << "hi! compute loss!" << std::endl; + float *each_cros; + cudaMalloc((void**)&each_cros, sizeof(float) * SIZE_OUTPUT); + checkCUDAError("cudaMalloc each_cros failed!"); + + float *real_each; + cudaMalloc((void**)&real_each, sizeof(float) * SIZE_OUTPUT); + checkCUDAError("cudaMalloc real_each failed!"); + cudaMemcpy(real_each, real + (ind * SIZE_OUTPUT), sizeof(float) * SIZE_OUTPUT, cudaMemcpyDeviceToDevice); + + dim3 fullBlocksPerGrid((SIZE_OUTPUT + blockSize - 1) / blockSize); + //kernmse_loss << > > (SIZE_OUTPUT, real_each, pred, each_cros); + kerncross_entropy << > > (SIZE_OUTPUT, real_each, pred, each_cros); + + //sum + float *each_cros_host = new float[SIZE_OUTPUT]; + cudaMemcpy(each_cros_host, each_cros, sizeof(float) * SIZE_OUTPUT, cudaMemcpyDeviceToHost);//host to device + //std::cout << "each loss:" << std::endl; + //printM(each_cros_host, 1, SIZE_OUTPUT); + + float loss = Get_sum(SIZE_OUTPUT, each_cros_host); + + cudaFree(each_cros); + free(each_cros_host); + return loss; + } + + /////////////////////////////gradient//////////////////////////////////// + __global__ void kernAdjW(int n, float *M) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + M[index] = 2.0 * M[index] - 1.0; + } + + //(SIZE_OUTPUT, out_soft, dev_odata, error1) + __global__ void kernSub(int n, float* A, float* B, float* C) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + C[index] = A[index] - B[index]; + } + + __global__ void kernAdd(int n, float* A, float* B, float* C) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + C[index] = A[index] + B[index]; + } + + //sigfun is the sig result + __global__ void kernSig_partial_deriv(int n, float* sigfun, float* odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + odata[index] *= (sigfun[index] * (1.0 - sigfun[index])); + } + + __global__ void kernMse_deri(int n, const float *input, float *real, float *odata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + odata[index] *= (input - real); + } + + __global__ void kernUp_Wei(int n, float *wei, float* gradient, float learning_rate) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + wei[index] = wei[index] - learning_rate * gradient[index]; + } + + //for simple one data meg + void forwardpass(int instance_index, float* idata, + cublasHandle_t &handle) { + float* data; + cudaMalloc((void**)&data, SIZE_INPUT * sizeof(float)); + checkCUDAError("cudaMalloc data failed!"); + cudaMemcpy(data, idata + (instance_index * SIZE_INPUT), sizeof(float) * SIZE_INPUT, cudaMemcpyDeviceToDevice); + + float* host_data = new float[SIZE_INPUT]; + cudaMemcpy(host_data, data, sizeof(float) * SIZE_INPUT, cudaMemcpyDeviceToHost); + + //compute hidden layer + GetLayerOutput(SIZE_INPUT,//size_of_current_layer + SIZE_HiD,//size_of_next_layer + data,//idata + weights_inandhid,//weights + dev_hid,//hidden + handle); + + //Compute sigmoid if hidden layer + dim3 fullBlocksPerGrid((SIZE_HiD + blockSize - 1) / blockSize); + kernSigmoid << > > (SIZE_HiD, dev_hid, hid_sig); + + //Compute output layer + GetLayerOutput(SIZE_HiD,//size_of_current_layer + SIZE_OUTPUT,//size_of_next_layer + hid_sig,//hidden_sig + weights_hidandoutput,//weights + dev_output,//output layer + handle); + + dim3 fullBlocksPerGrid2((SIZE_OUTPUT + blockSize - 1) / blockSize); + kernSigmoid << > > (SIZE_OUTPUT, dev_output, out_sig); + + //Compute softmax of output layer + softmax(SIZE_OUTPUT, dev_output, out_soft); + } + + //Computes the gradient for the current pass. + void compute_grad(int instance_number, + const float* input, const float* real, + cublasHandle_t &handle) { + + float* dev_idata; + float* dev_odata; + cudaMalloc((void**)&dev_idata, sizeof(float) * SIZE_INPUT); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, sizeof(float) * SIZE_OUTPUT); + checkCUDAError("cudaMalloc dev_odata failed!"); + + cudaMemcpy(dev_idata, input + (instance_number * SIZE_INPUT), sizeof(float) * SIZE_INPUT, cudaMemcpyDeviceToDevice); + cudaMemcpy(dev_odata, real + (instance_number * SIZE_OUTPUT), sizeof(float) * SIZE_OUTPUT, cudaMemcpyDeviceToDevice); + + //Compute gradient weights between hidden and output layer + float* error1; + cudaMalloc((void**)&error1, SIZE_OUTPUT * sizeof(float)); + checkCUDAError("cudaMalloc error1 failed!"); + float* error2;// input and hidden layer + cudaMalloc((void**)&error2, SIZE_HiD * sizeof(float)); + checkCUDAError("cudaMalloc error2 failed!"); + + //the error between predict and real + dim3 fullBlocksPerGrid((SIZE_OUTPUT + blockSize - 1) / blockSize); + //softmax and entrpy_cross derivative + kernSub << > > (SIZE_OUTPUT, out_soft, dev_odata, error1); + + //sigmoid derivative + //dim3 fullBlocksPerGrid3((SIZE_OUTPUT + blockSize - 1) / blockSize); + kernSig_partial_deriv << > > (SIZE_OUTPUT, out_sig, error1); + + Mul(hid_sig,//A(hid_size*1) m*k + error1,//B (1 * out_size) k*n + wgrad_h2o,//C (SIZE_HiD * SIZE_OUTPUT) m*n + SIZE_HiD,//m + SIZE_OUTPUT,//n + 1,//k + handle); + + Mul(weights_hidandoutput,//(size_output*size_hid) + error1,//(size_out*1) + error2,//(hid_size*1) + SIZE_HiD,//m + 1,//n + SIZE_OUTPUT,//k + handle); + + //sigmoid derivative + dim3 fullBlocksPerGrid2((SIZE_HiD + blockSize - 1) / blockSize); + kernSig_partial_deriv << > > (SIZE_HiD, hid_sig, error2); + + Mul(dev_idata,//(size_input*1) + error2,//(1*hid_size) + wgrad_i2h,//size_input*size_hid + SIZE_INPUT,//m + SIZE_HiD,//n + 1,//k + handle); + + /*//debug + float *t1 = new float[SIZE_HiD*SIZE_INPUT]; + cudaMemcpy(t1, wgrad_i2h, sizeof(float) * SIZE_HiD*SIZE_INPUT, cudaMemcpyDeviceToHost); + printf("grad weight hid input: \n"); + //printM(t1, SIZE_INPUT, SIZE_HiD); + + //yes + float *t2 = new float[SIZE_HiD*SIZE_OUTPUT]; + cudaMemcpy(t2, wgrad_h2o, sizeof(float) * SIZE_HiD*SIZE_OUTPUT, cudaMemcpyDeviceToHost); + printf("grad weight hid output: \n"); + //printM(t2, SIZE_OUTPUT, SIZE_HiD); + */ + + cudaFree(error1); + cudaFree(error2); + cudaFree(dev_idata); + cudaFree(dev_odata); + } + + //cublas is col based! + //ref: https://blog.csdn.net/zcy0xy/article/details/84555053#cuBLAS_12 + void build_network(int data_count, int num_feature, int num_class, + int hid_size, float ler, float loss_thre) { + + input_count = data_count; + SIZE_INPUT = num_feature; + SIZE_HiD = hid_size; + SIZE_OUTPUT = num_class; + lr = ler; + loss_threshold = loss_thre; + + //malloc memory for layer weights + cudaMalloc((void**)&weights_inandhid, (SIZE_INPUT * SIZE_HiD) * sizeof(float)); + checkCUDAError("cudaMalloc weights_inandhid failed!"); + cudaMalloc((void**)&weights_hidandoutput, (SIZE_HiD * SIZE_OUTPUT) * sizeof(float)); + checkCUDAError("cudaMalloc weights_hidandoutput failed!"); + + //initialize weight with random number + //ref: https://blog.csdn.net/wesley_2013/article/details/12175391 + curandGenerator_t gen; + curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);//set the random algorithm + curandSetPseudoRandomGeneratorSeed(gen, rand());//initial random number + //set weights + curandGenerateUniform(gen, weights_inandhid, SIZE_INPUT * SIZE_HiD); + curandGenerateUniform(gen, weights_hidandoutput, SIZE_HiD * SIZE_OUTPUT); + + //FROM 0-1 to -1-1 + dim3 fullBlocksPerGrid((SIZE_INPUT * SIZE_HiD + blockSize - 1) / blockSize); + kernAdjW << > > (SIZE_INPUT * SIZE_HiD, weights_inandhid); + dim3 fullBlocksPerGrid2((SIZE_HiD * SIZE_OUTPUT + blockSize - 1) / blockSize); + kernAdjW << > > (SIZE_HiD * SIZE_OUTPUT, weights_hidandoutput); + + /*//debug + float * wih = new float[SIZE_INPUT * SIZE_HiD]; + float * who = new float[SIZE_OUTPUT * SIZE_HiD]; + cudaMemcpy(wih, weights_inandhid, sizeof(float) * (SIZE_INPUT * SIZE_HiD), cudaMemcpyDeviceToHost); + cudaMemcpy(who, weights_hidandoutput , sizeof(float) * (SIZE_OUTPUT * SIZE_HiD), cudaMemcpyDeviceToHost); + */ + + //hidden layer and output layer on device + cudaMalloc((void**)&dev_hid, SIZE_HiD * sizeof(float)); + checkCUDAError("cudaMalloc dev_hid failed!"); + cudaMalloc((void**)&dev_output, SIZE_OUTPUT * sizeof(float)); + checkCUDAError("cudaMalloc dev_output failed!"); + + //weights grads dev memory + cudaMalloc((void**)&wgrad_i2h, (SIZE_INPUT * SIZE_HiD) * sizeof(float)); + checkCUDAError("cudaMalloc wgrad_i2h failed!"); + cudaMalloc((void**)&wgrad_h2o, (SIZE_HiD * SIZE_OUTPUT) * sizeof(float)); + checkCUDAError("cudaMalloc wgrad_h2o failed!"); + + //mid-calculation data memory + cudaMalloc((void**)&hid_sig, SIZE_HiD * sizeof(float)); + checkCUDAError("cudaMalloc hid_sig failed!"); + cudaMalloc((void**)&out_sig, SIZE_OUTPUT * sizeof(float)); + checkCUDAError("cudaMalloc out_sig failed!"); + cudaMalloc((void**)&out_soft, SIZE_OUTPUT * sizeof(float)); + checkCUDAError("cudaMalloc out_soft failed!"); + std::cout << "successfully build the network!" << std::endl; + } + + void train(float* input, float* real, int train_time) { + float epi_loss = 0.0; + + //cuBlas handle + cublasHandle_t handle; + cublasCreate(&handle); + + //malloc memory for input and real + cudaMalloc((void**)&dev_input, (input_count * SIZE_INPUT) * sizeof(float)); + checkCUDAError("cudaMalloc dev_input failed!"); + cudaMalloc((void**)&dev_real, (input_count * SIZE_OUTPUT) * sizeof(float)); + checkCUDAError("cudaMalloc dev_real failed!"); + cudaMemcpy(dev_input, input, sizeof(float) * (input_count * SIZE_INPUT), cudaMemcpyHostToDevice); + cudaMemcpy(dev_real, real, sizeof(float) * (input_count * SIZE_OUTPUT), cudaMemcpyHostToDevice); + + for (int i = 0; i < train_time; i++) { + //for each data + epi_loss = 0.0; + for (int j = 0; j < input_count; j++) { + //Forward_Pass + forwardpass(j, dev_input, handle); + + ////////back propa////////////////// + //get Loss + epi_loss += compute_loss(dev_real, out_soft, j); + //Compute grads + compute_grad(j, dev_input, dev_real, handle); + + //update weights + dim3 fullBlocksPerGrid(((SIZE_INPUT * SIZE_HiD) + blockSize - 1) / blockSize); + kernUp_Wei << > > (SIZE_INPUT * SIZE_HiD, weights_inandhid, wgrad_i2h, lr); + dim3 fullBlocksPerGrid2(((SIZE_HiD * SIZE_OUTPUT) + blockSize - 1) / blockSize); + kernUp_Wei << > > (SIZE_HiD * SIZE_OUTPUT, weights_hidandoutput, wgrad_h2o, lr); + + } + std::cout << "epoch " << i << " loss : " << epi_loss / (1.0 * input_count) << std::endl; + if (epi_loss / (1.0 * input_count) < loss_threshold && !flag) { + flag = true; + break; + } + } + if (flag) { + float * wih = new float[SIZE_INPUT * SIZE_HiD]; + float * who = new float[SIZE_OUTPUT * SIZE_HiD]; + cudaMemcpy(wih, weights_inandhid, sizeof(float) * (SIZE_INPUT * SIZE_HiD), cudaMemcpyDeviceToHost); + cudaMemcpy(who, weights_hidandoutput, sizeof(float) * (SIZE_OUTPUT * SIZE_HiD), cudaMemcpyDeviceToHost); + + Write_Weights2File("input_hid_w.txt", wih, input_count, SIZE_HiD); + Write_Weights2File("hid_out_w.txt", who, SIZE_HiD, SIZE_OUTPUT); + + delete(wih); + delete(who); + } + + // Destroy the handle + cublasDestroy(handle); + } + + //test + void test(float* test_input, int test_times, int type) { + int t = type == 1 ? test_times : input_count; + //if (test_times <= 0 || !flag) { + // std::cout << "error!" << std::endl; + // return; + //} + + //dev memory for test data + float* test_in; + cudaMalloc((void**)&test_in, (input_count * SIZE_INPUT) * sizeof(float)); + checkCUDAError("cudaMalloc test_in failed!"); + cudaMemcpy(test_in, test_input, sizeof(float) * (input_count * SIZE_INPUT), cudaMemcpyHostToDevice); + + //CUBLAS handle + cublasHandle_t handle; + cublasCreate(&handle); + + int correct_time = 0; + float *re = new float[SIZE_OUTPUT]; + + if (type == 1) { + //test specific times + while (test_times) { + int ind = (10000 * (int)rand()) % input_count; + forwardpass(ind, test_in, handle); + std::cout << "pic: " << ind + 1 << " pred: "; + cudaMemcpy(re, out_soft, sizeof(float) * SIZE_OUTPUT, cudaMemcpyDeviceToHost); + int result = GetfinalJudge(SIZE_OUTPUT, re); + if (result == ind + 1) { + correct_time++; + } + std::cout << " " << result << std::endl; + test_times--; + } + } + else if (type == 2) { + for (int i = 0; i < input_count; i++) { + forwardpass(i, test_in, handle); + std::cout << "pic: " << i + 1 << " pred: "; + cudaMemcpy(re, out_soft, sizeof(float) * SIZE_OUTPUT, cudaMemcpyDeviceToHost); + int result = GetfinalJudge(SIZE_OUTPUT, re); + if (result == i + 1) { + correct_time++; + } + std::cout << " " << result << std::endl; + } + } + std::cout << "test time: " << t + << " ,correct probability: " + << ((1.0 * correct_time) / t) * 100.0 << "%" << std::endl; + + delete(re); + cublasDestroy(handle); + cudaFree(dev_hid); + cudaFree(dev_input); + cudaFree(dev_output); + cudaFree(dev_real); + cudaFree(out_sig); + cudaFree(out_soft); + cudaFree(hid_sig); + cudaFree(weights_inandhid); + cudaFree(weights_hidandoutput); + } } diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h index 2096228..7eca5ed 100644 --- a/Project2-Character-Recognition/character_recognition/mlp.h +++ b/Project2-Character-Recognition/character_recognition/mlp.h @@ -3,7 +3,14 @@ #include "common.h" namespace CharacterRecognition { - Common::PerformanceTimer& timer(); + Common::PerformanceTimer& timer(); - // TODO: implement required elements for MLP sections 1 and 2 here + // TODO: implement required elements for MLP sections 1 and 2 here + + void build_network(int data_count, int num_feature, int num_class, + int hid_size, float ler, float loss_thre); + + void train(float* input, float* real, int train_time); + + void test(float* test_input, int test_times, int type); } diff --git a/Project2-Character-Recognition/hid_out_w.txt b/Project2-Character-Recognition/hid_out_w.txt new file mode 100644 index 0000000..f415834 --- /dev/null +++ b/Project2-Character-Recognition/hid_out_w.txt @@ -0,0 +1,25 @@ +0.29451 -0.435097 -0.585486 2.11342 -0.392652 0.186835 0.162754 0.403842 -0.0806076 1.80997 -0.905296 -0.322723 -0.601053 1.18543 0.229689 0.583769 0.0670121 -0.325607 -0.347061 -0.290325 -0.551759 -0.585138 -0.586835 0.711462 0.0135424 0.0780356 -0.938877 -0.421816 -0.278796 -0.703108 -0.50931 -1.11288 -0.0481748 0.300262 -0.784069 0.753415 0.396709 -0.332285 -0.228926 -0.256273 -0.194732 -0.753241 0.0995787 1.58972 -0.990666 -0.0140221 -0.415825 1.6434 1.50603 1.97106 0.0276378 -0.639799 +-0.213165 0.421657 -0.138217 1.23259 0.461238 -0.205997 0.213159 0.303983 -0.965584 0.168417 -0.509231 1.18642 -0.437274 1.7671 -0.753888 -0.320403 -0.0979614 0.119216 0.739964 -0.669452 -0.961166 0.0619697 0.133472 -0.638928 -0.935986 -0.059078 -1.10847 -0.4074 -0.852919 0.372626 -1.00291 0.678094 -0.516678 0.570887 0.205791 1.11327 -1.08914 1.29768 1.17523 -0.723096 -0.0941954 0.124456 0.196102 0.0643309 0.328557 0.0404158 1.89996 -0.145085 0.231199 0.31506 0.438671 0.442274 +1.07913 0.0743581 -0.462595 -0.990382 1.15425 -0.229805 -1.21913 -0.342996 -1.26313 -0.285098 -1.31393 1.06496 0.191175 0.256261 -1.18209 1.06213 -0.475579 -0.394601 -0.899963 -0.479416 0.0677363 0.834295 1.63491 -0.747539 0.860293 -0.529691 -0.108032 -0.0836923 0.461671 0.56414 0.196239 -0.167353 -0.615122 0.148807 0.152861 -0.34132 0.056369 1.45723 -0.684265 0.0762661 0.792114 1.16189 1.45754 -0.188783 -0.650672 -0.217166 -1.15054 -0.0433124 0.322781 0.337733 0.539607 -0.161182 +-1.40076 1.12502 -0.969809 -0.373178 -0.803651 1.24721 -0.330142 -1.1367 0.186529 -0.770568 -0.423989 -1.08687 -0.468014 -1.17954 -0.161503 0.833769 0.798625 0.0489926 -1.07293 0.945241 1.2868 -0.340574 -0.845284 -0.390045 0.553393 -0.91396 0.994309 -0.765059 -0.226574 0.321396 -0.952827 -0.00477712 -0.967609 -0.368021 0.87912 0.616292 1.57929 -0.728144 1.4013 -0.22562 -0.248204 0.347561 0.0740322 -0.391817 -1.12601 1.44342 1.15853 0.933214 0.605356 -0.182307 -0.566102 0.716688 +-0.0350349 -0.928915 0.0175354 -0.980479 -0.753281 -0.584979 -0.235664 0.0521397 -0.859456 -0.0755728 -0.813116 -0.683933 -1.21224 0.0386294 -0.277261 0.632587 1.14674 0.948391 -0.270966 -0.890896 0.790637 0.971929 -0.585058 -0.170487 0.714866 -0.428455 0.885471 0.0207451 -0.312878 -0.552794 -0.126212 0.922482 -0.691675 0.708875 0.984582 1.12845 -1.26206 -0.149199 -0.593778 0.724934 0.90782 0.0349696 -0.127147 0.490107 0.380464 -0.484554 -0.0998256 0.475704 -0.415316 0.373625 0.322898 1.42113 +0.695211 -0.149149 0.965554 0.379566 0.264172 0.745678 0.579864 0.413336 -0.322225 0.588068 -0.0757539 -0.586247 -0.146529 -0.704881 0.566705 -1.30622 1.04584 -1.23437 1.0738 0.585838 0.501724 0.729652 -1.31366 -1.73743 -1.4755 -1.26356 0.787037 -1.09577 0.984398 -0.545426 -0.162021 -0.473739 0.953205 0.606477 -0.43861 -1.32034 0.185602 -0.144099 0.750759 -0.654322 0.326066 0.499789 1.44527 0.240112 0.715035 0.763042 0.183189 -1.50034 0.653686 0.0104582 -1.21378 0.965676 +-0.685996 -0.651483 -1.05557 0.407174 -1.22577 0.366098 -0.395385 -1.29259 -0.960387 0.341325 0.776629 -0.656284 -0.417534 -1.15328 -0.188162 1.13075 0.54502 1.40119 -0.827696 -0.591806 0.248512 0.505993 1.08721 1.09469 -1.28169 -0.327892 0.0182214 -0.976317 0.939887 -1.06712 1.37528 0.747421 0.63332 -1.15303 -0.450049 0.168935 -0.418876 -0.59708 0.401593 0.506597 0.553412 -0.0286503 0.259326 -1.05651 -0.868733 -0.996118 -0.520162 0.249165 -0.0194128 0.967665 -0.876096 -0.315031 +-0.884281 0.776825 -0.0686684 -0.10858 0.404821 -0.784125 0.011065 -1.08571 -0.678831 0.196565 1.17585 -0.366034 0.267562 0.106757 0.967818 -0.958142 0.533746 -0.799019 1.23816 0.738888 -0.400174 -0.161328 -0.536213 0.392799 -0.649473 0.416614 -0.264423 0.392731 -1.0784 0.738391 -0.858962 -0.0839211 -0.132845 -0.00979847 1.34935 1.63852 0.638695 -1.00605 1.29015 1.15563 -0.983463 -0.54543 -0.810463 -0.470475 -1.36093 1.33065 0.218207 -0.237392 -0.547637 0.293879 -0.110275 0.43099 +-0.924337 0.655007 -1.47579 0.478897 -0.272729 1.1146 0.759748 -1.34789 0.982697 -1.41719 -1.05318 1.05607 0.542403 -0.269016 -1.2289 0.618554 -0.00832141 0.2573 0.245668 0.380477 0.570197 0.385505 0.556844 0.0214327 -1.33099 0.337729 -1.60551 0.46773 0.550296 0.774753 0.0419883 -1.43382 0.95215 0.101809 -0.678695 -0.716643 0.543504 0.607451 0.858323 0.318042 -0.393903 -0.680624 -0.142666 -0.0367496 -0.91302 -1.42492 0.370195 -0.202598 0.535927 0.102123 0.87294 1.16098 +0.71109 -1.16267 1.16387 0.464426 -0.101168 1.38333 0.988723 0.910633 1.09957 0.790745 1.08259 0.30878 0.191628 1.48429 -1.14159 0.607356 -0.601688 -0.457902 0.936049 -1.14503 -0.873696 -0.491195 -0.97481 -0.585913 -0.564812 -1.70864 -0.719954 1.38325 1.31945 1.23338 -1.33308 1.17779 -1.0746 0.267207 -0.896233 -0.314204 0.536163 -0.260908 -0.443571 -0.776849 -0.615946 -0.0875837 -0.626001 0.128988 -0.510283 0.538623 -0.140031 -0.487799 -1.206 1.26023 -0.836838 -0.174957 +-0.0159583 -0.554108 -1.00089 -0.24234 1.10866 -0.795081 1.53728 0.943334 -1.02905 0.309063 0.0268236 -0.511166 0.831202 0.140488 -0.454913 -0.231738 -0.415469 1.40654 -0.524821 -0.133363 1.22243 -1.47591 0.561386 -0.851701 0.302247 -0.969312 -0.270788 -0.984972 0.390613 -0.545012 0.739465 0.665869 1.12178 -0.957192 -0.85782 0.30503 0.390882 0.629166 1.23095 1.34746 0.259721 1.37151 0.245741 0.83234 1.42711 0.547817 -0.798499 -0.922349 -1.06595 -0.894178 1.26676 0.610563 +0.1675 -0.563675 1.29057 -0.289968 1.20455 -0.471947 -0.865962 -0.250471 1.5353 0.22896 -1.17181 -1.06629 1.11554 -0.0766227 0.111997 -0.776727 -0.962588 1.05882 0.00509949 0.583927 -0.568768 0.466551 -0.0580937 -0.797289 -0.382499 -0.690725 0.410141 -0.545434 -0.519888 -0.715794 0.225257 1.70767 -0.628473 0.756321 -0.0107868 -1.05874 -1.03816 -1.01549 0.594414 -0.22068 0.487819 -0.104925 -0.519148 0.853449 -0.165595 -1.03138 1.24996 1.01547 1.24847 -0.834457 1.16176 -0.504422 +0.716361 0.0700269 -0.570294 0.783163 -1.10209 0.264594 -1.1336 0.743759 0.832235 0.876864 0.21862 0.551778 -1.46908 -0.131688 -1.03312 0.164697 0.474699 -1.20382 0.299676 0.688255 0.689297 -0.178173 0.476928 0.976139 1.17639 0.0262772 0.792254 0.180292 -1.4375 -0.927538 -0.566877 -0.982735 1.1185 -1.16178 -1.10506 0.226828 0.934476 -1.24123 -0.705437 -0.899701 0.487377 -1.35017 1.0461 -0.382314 -0.864332 -0.558506 0.00838594 -0.991959 0.706753 0.665023 0.809266 -1.55715 +-1.40551 -0.417117 -0.013 0.077838 0.730688 0.673288 -1.24105 0.447386 -0.144648 -0.821339 0.157852 0.439744 0.369823 0.383937 -0.201754 -0.012288 0.780191 1.01423 0.873674 1.20377 -0.777819 0.427524 0.28398 0.925649 -0.310545 0.240699 0.876687 -1.26063 -0.325244 1.27274 0.19729 -1.19567 -1.18687 -0.85281 -0.053213 0.117924 -1.31194 0.343471 -1.18672 0.467246 1.7128 -0.983386 -1.13042 1.18184 -1.25551 -0.42306 -0.427298 0.749168 -0.28059 0.281642 -0.694903 -0.222824 +-0.598591 0.822093 -1.25032 -0.875711 0.520949 -0.793229 1.18758 -0.369461 -0.0127338 -0.560747 0.539263 0.257154 0.27888 -0.0825568 0.852039 0.232866 0.440438 -0.378332 0.656851 1.05967 -1.16586 -1.10597 -0.881673 0.853027 0.957047 0.291303 -0.918031 0.354686 -0.598139 0.0147794 1.40297 0.379543 -0.127202 1.04961 -1.4476 -1.06018 0.28668 0.895301 -0.71472 0.0478485 1.37645 -1.03011 0.427315 -0.193764 1.24949 -0.432289 -0.567705 0.482442 -0.623365 -1.02192 -0.802499 -0.353789 +-0.0491603 -1.82128 1.08498 0.679567 -0.973928 -1.05803 0.314734 0.906591 0.204206 -0.66395 0.307383 -1.7602 0.0795372 -1.25419 1.15236 0.280503 -0.247753 0.466512 0.68999 -0.0691435 -0.771696 -0.316796 0.800363 0.698589 0.538966 -0.016899 0.287845 0.348061 0.866683 -0.0251087 -0.0630856 -0.54655 0.19943 1.16332 -0.278769 0.406851 -0.420093 -0.206772 -0.151451 0.948157 -1.43865 0.573489 -1.17248 0.101 0.374222 0.195381 0.358069 -0.433896 0.498112 0.511261 -1.58584 -1.45827 +0.0231081 0.595089 1.11196 -0.189116 0.349486 -0.105623 -0.235119 -0.73164 0.470907 0.290571 0.854769 0.287085 0.59691 0.795134 0.239294 -1.64299 -0.0975536 0.0586043 -1.51426 -0.247496 0.378959 0.209136 1.08545 0.927518 0.109562 0.959405 0.163349 -1.06916 -0.788158 -1.04875 0.920949 0.851285 -0.767247 1.01537 -0.400141 -1.1908 -0.117827 -1.34558 -0.863665 0.999258 -1.41712 1.21436 -1.32643 0.313795 0.396155 0.54129 0.7869 -0.916468 0.671449 -0.196267 -0.197521 -1.27238 +0.850957 0.30812 -0.0829649 -0.0483991 0.228223 -0.670504 -0.098042 -0.932607 0.43993 -0.381196 0.447945 0.726889 -0.217291 -0.242501 0.103011 0.932602 0.29992 -0.132239 -0.41593 0.113875 0.0487252 -0.467437 -0.325686 0.479561 0.0338719 -0.305358 0.511878 0.478447 0.833054 -0.360977 -0.460933 0.276825 0.231447 -1.71089 -0.497162 0.059896 0.130965 0.0664659 -0.310532 -0.502204 -1.07156 -0.601345 -0.175668 -0.414229 -0.506793 -0.476298 0.293695 0.516251 -1.0936 -0.927461 1.08087 0.805068 +1.07124 -0.741413 0.883863 -0.229893 1.06989 -0.188237 0.17842 0.238984 0.298448 -1.37671 -0.785058 1.30999 1.11309 -0.160048 -0.899954 -0.739763 -1.14898 0.229157 0.25889 -1.195 -0.422784 0.827332 -0.98184 -0.497282 0.751617 0.851052 0.697569 0.106153 -1.32471 -0.37182 0.177117 0.177245 -1.08397 1.00086 0.130366 -0.114817 -0.421387 0.218223 -0.821985 0.173342 1.35035 -1.33415 -0.705626 -0.0142975 0.581109 -0.378751 1.06961 0.0283819 0.995848 1.38182 0.0247649 -0.345845 +1.11307 0.80242 -1.34112 0.234482 -1.61926 -0.0979021 -1.21615 1.02459 -1.20557 0.665714 -0.383876 -0.00809963 1.42684 0.858919 -0.267721 0.251381 -1.10644 -0.540579 -0.790976 -0.285183 0.55201 -1.39357 -1.32714 -1.28009 0.465189 1.41355 0.848218 1.10843 0.602338 0.560005 1.08591 0.336866 1.20857 0.685375 1.37877 0.636274 0.888443 0.867822 -1.06764 -0.2483 -0.895826 -0.567381 -1.44541 -1.25243 0.275466 1.14229 -0.805663 0.350166 -1.14731 0.213879 -1.26018 0.10007 +-0.976967 0.0351076 -0.176011 0.238311 0.251076 0.816201 -1.08086 0.971586 -0.721313 1.32761 -0.648624 -0.349771 -0.459281 -0.282859 1.72185 0.345002 -0.231086 -0.714348 -0.449467 -0.201789 0.0164356 -0.734322 0.178983 0.192123 0.995547 -0.654331 -0.554646 -0.245755 -0.189461 0.738691 0.720443 0.542568 -0.389211 -0.262263 1.58777 -0.155924 1.16948 -0.730615 -0.44205 -0.862527 0.60484 0.469911 1.1857 1.39041 1.44647 -0.310174 0.373634 -0.0758999 1.28584 -0.419142 -0.441103 0.440551 +-0.755702 0.499029 -0.338068 -0.48455 -0.698731 -0.174335 0.426116 -0.316964 -0.294103 -0.481109 -0.560564 -0.5311 -0.602031 0.563242 -0.480201 -0.627319 0.0645848 -0.214602 -0.350051 -1.91533 0.57671 0.0178644 0.755524 -0.0599938 -0.501426 0.269376 -0.351683 0.673376 -0.375094 0.186845 -0.278852 -0.183904 0.0425069 -0.233436 0.812019 0.0702625 -0.965793 0.37504 -0.108949 -2.03536 -0.326581 0.719141 0.522266 -0.425531 0.591411 1.17916 0.0122099 0.130707 0.0825948 -0.346249 -0.167625 0.288622 +0.0895034 -0.487552 0.249662 -0.431912 -0.330131 -0.306152 0.886456 0.607595 0.610502 1.09064 1.01733 0.480097 0.217418 -0.312296 0.272422 -0.813556 -0.488501 -0.446102 0.645456 -0.44372 -0.686488 0.762202 -0.0949142 -1.00683 -0.430218 1.47472 -0.461202 -0.304624 -0.374726 0.88428 0.843938 -1.3205 0.0515976 -0.15242 0.714611 -0.38514 -0.591027 -0.976015 0.406904 -0.876874 0.0550551 -0.212715 0.438441 -0.0960418 0.153012 -0.346421 -0.448915 0.72321 -0.454293 -0.15757 -0.607012 -1.31695 +0.454334 0.377923 0.165884 0.706319 0.748977 0.744489 -0.0889526 -0.0245744 0.31914 -0.00298467 0.135193 -0.451374 -0.156367 -1.27671 1.20855 -0.627874 -0.420813 -0.405182 -0.0493464 1.3658 -1.46034 0.911746 -1.15491 -0.675364 0.167173 0.316687 -1.31124 1.18794 -0.642811 -1.33091 -1.21475 0.631375 -0.705716 -0.832662 -0.197445 0.746793 -0.980156 0.325972 -0.130212 1.16222 -0.719995 0.624846 0.398638 0.845189 0.0753604 -1.38409 -1.31722 0.432948 -0.766921 -0.960425 1.27198 0.744931 + diff --git a/Project2-Character-Recognition/img/acc.png b/Project2-Character-Recognition/img/acc.png new file mode 100644 index 0000000..9831107 Binary files /dev/null and b/Project2-Character-Recognition/img/acc.png differ diff --git a/Project2-Character-Recognition/img/epi.png b/Project2-Character-Recognition/img/epi.png new file mode 100644 index 0000000..09faa12 Binary files /dev/null and b/Project2-Character-Recognition/img/epi.png differ diff --git a/Project2-Character-Recognition/img/epi2.png b/Project2-Character-Recognition/img/epi2.png new file mode 100644 index 0000000..09faa12 Binary files /dev/null and b/Project2-Character-Recognition/img/epi2.png differ diff --git a/Project2-Character-Recognition/img/eq1.png b/Project2-Character-Recognition/img/eq1.png new file mode 100644 index 0000000..b537c94 Binary files /dev/null and b/Project2-Character-Recognition/img/eq1.png differ diff --git a/Project2-Character-Recognition/img/eq2.png b/Project2-Character-Recognition/img/eq2.png new file mode 100644 index 0000000..2409477 Binary files /dev/null and b/Project2-Character-Recognition/img/eq2.png differ diff --git a/Project2-Character-Recognition/input_hid_w.txt b/Project2-Character-Recognition/input_hid_w.txt new file mode 100644 index 0000000..2709175 --- /dev/null +++ b/Project2-Character-Recognition/input_hid_w.txt @@ -0,0 +1,53 @@ +-0.5555 -0.52768 -0.0630369 0.973843 0.695499 0.0503175 0.659901 -0.453903 0.148849 -0.825282 0.761814 0.223129 -0.534589 -0.755595 0.0274895 -0.674924 -0.979328 -0.0724995 0.0624142 0.583496 0.756053 0.993953 -0.745123 0.248258 +0.671619 -0.916772 0.00467575 -0.00792134 0.615687 -0.29876 0.523726 0.964567 0.0762572 0.92174 0.606627 0.794047 0.540835 -0.845099 -0.474254 0.0679946 0.992761 -0.329864 -0.191108 -0.768588 0.781585 -0.903238 -0.0490134 -0.652115 +0.793939 -0.864402 0.673466 -0.453382 -0.106918 -0.0640095 -0.923453 0.440157 0.51514 -0.033702 -0.390384 0.157632 -0.9282 0.00163734 0.454568 0.442521 -0.0216578 -0.627653 0.89836 0.803623 -0.977185 0.652039 -0.169971 0.0482441 +0.122545 -0.616293 0.170112 -0.845009 0.880404 -0.96859 0.769148 -0.464954 0.386879 -0.359254 -0.89285 0.59805 -0.584641 -0.42141 -0.657194 -0.181137 0.695559 -0.414945 -0.983384 0.965578 0.400473 -0.137397 -0.540553 0.285491 +0.293954 -0.582963 0.109354 0.0620784 0.731926 0.967349 0.826106 0.60468 -0.994507 -0.922442 -0.178321 0.78011 -0.649838 -0.91854 -0.69214 0.0531853 -0.031935 -0.156015 -0.48426 -0.614013 0.872099 -0.264579 0.621982 0.922115 +-0.598051 -0.0823568 0.890652 0.0837533 0.196772 -0.102436 0.0130262 -0.200961 -0.0321433 0.593794 0.274011 -0.224082 0.688557 0.120398 0.0469794 -0.427774 0.0607226 -0.171424 -0.6866 -0.0336972 -0.484648 0.0438639 -0.814904 0.28909 +0.0355321 -0.101123 0.401669 0.530574 0.348205 0.661931 -0.833829 -0.646286 -0.0019151 -0.706252 0.20436 -0.49592 0.0368049 0.356519 -0.0827079 0.0702157 -0.35397 0.976748 0.752387 -0.33936 0.711099 -0.542174 0.601845 0.286795 +-0.978952 0.885496 -0.54828 -0.997901 -0.3147 -0.718796 0.0193797 -0.678103 0.471666 0.416684 0.801137 -0.938392 -0.487574 -0.604847 -0.388337 -0.398295 -0.87199 -0.516849 0.861687 -0.557997 0.548658 0.0198796 0.258155 0.453576 +-0.85626 0.680265 -0.707601 0.0478959 -0.511746 0.238608 0.145991 -0.994167 -0.81099 0.196579 -0.919748 0.893742 0.11938 0.404978 0.1711 0.956976 -0.744058 0.673223 0.871307 0.140369 -0.655791 0.130366 -0.461749 -0.358724 +0.97054 -0.476199 -0.580746 -0.45115 0.400503 -0.530029 0.352454 0.786928 -0.164934 0.490672 0.734457 0.468873 -0.445057 0.131027 -0.156457 -0.827257 -0.186187 -0.824661 -0.301378 0.0264277 0.194392 -0.323447 -0.915515 0.066294 +0.925847 0.296991 -0.245848 0.359212 -0.603158 -0.702539 -0.131244 0.124782 -0.355855 -0.876804 0.797568 0.280146 0.330565 0.205896 0.487288 -0.682948 0.0534059 -0.687544 0.584013 0.194481 0.258519 0.902275 -0.928634 0.207863 +0.878576 0.566201 -0.668169 0.364112 -0.273378 0.667649 -0.0714133 0.280254 -0.451078 -0.933793 -0.20078 0.607438 -0.566217 -0.506394 0.449277 -0.952784 -0.340174 0.517707 0.426085 0.496671 0.0325271 -0.111466 -0.658498 0.790769 +0.147246 -0.954811 -0.571859 0.968867 0.2465 -0.48934 -0.238826 0.120279 0.0448561 -0.19004 -0.469529 0.587174 -0.81027 -0.433105 -0.907405 0.946632 0.947259 -0.977589 -0.224312 0.796268 0.117285 0.117392 0.977149 -0.154005 +-0.127308 -0.466006 0.498404 0.786716 0.501212 0.025398 -0.855806 -0.830642 -0.89997 0.514473 -0.957111 -0.50487 -0.100711 0.754689 0.866331 -0.740767 0.281626 0.713027 -0.396019 -0.978078 0.924909 0.172641 0.659924 0.988271 +-0.0480892 -0.262592 0.646185 -0.830066 -0.456369 -0.51869 -0.507873 0.161911 -0.936702 0.943161 -0.18004 0.594854 0.51482 0.379497 0.624696 -0.102404 0.537824 -0.873409 0.448597 -0.218106 -0.435702 -0.737984 0.662539 -0.0149791 +0.104005 -0.537407 -0.975345 0.174569 0.526431 0.827075 0.968143 -0.0774499 -0.295282 -0.210457 -0.492659 -0.804943 0.3318 -0.540617 -0.412969 0.53523 0.521279 0.284964 -0.985563 0.429424 0.156637 0.135755 0.376979 -0.790342 +0.27688 0.120075 0.234007 -0.823301 0.143369 -0.388757 -0.450554 -0.619722 -0.757277 -0.175796 0.767232 0.138283 0.780712 -0.649365 0.273313 0.849019 -0.269056 0.216815 -0.253813 -0.281443 -0.11902 -0.206838 -0.579878 -0.705264 +0.688889 0.830906 0.993566 -0.801192 0.305878 0.445794 0.0834321 -0.0982576 -0.00787741 0.0213885 0.972073 -0.911027 0.0271214 0.633368 0.640284 0.447972 0.336867 -0.873511 0.507212 0.45285 0.325379 -0.35051 -0.335395 0.562327 +0.772191 -0.086458 0.575296 0.463959 0.661673 -0.0755599 0.316437 0.754155 -0.428832 -0.175107 -0.20822 -0.740061 -0.270031 0.708173 0.0674257 -0.166898 -0.260211 -0.978411 0.520459 0.16997 -0.796961 -0.70043 0.0203489 0.799347 +0.677113 0.441159 -0.104798 0.202062 -0.918476 -0.732334 -0.879266 0.85747 0.814499 -0.864753 -0.0847909 0.207776 0.905664 0.941291 0.212432 -0.856759 0.15521 -0.413177 -0.494948 0.680509 0.69954 0.523226 0.224711 -0.771282 +-0.98538 0.722714 0.311592 -0.298767 -0.825541 0.181599 -0.305038 -0.586577 0.0091033 -0.519639 0.777173 -0.469777 0.301011 0.0612264 -0.945485 -0.57641 0.281388 0.262096 -0.375763 -0.856022 -0.568265 -0.931932 -0.299868 -0.642054 +-0.66169 0.211981 -0.888998 0.909869 0.167492 0.909612 -0.402065 0.907366 0.430119 0.658651 -0.722798 0.290739 0.664424 0.117855 0.229407 -0.209626 0.407759 -0.579313 0.150747 -0.434808 0.64946 0.749631 -0.410736 -0.994386 +-0.742158 -0.378806 0.811475 -0.100921 0.946816 0.997832 0.497991 -0.0548304 -0.827313 0.883909 0.248482 -0.549973 0.548699 0.590544 -0.300296 -0.290366 -0.903002 0.0580134 -0.870023 -0.406338 -0.665861 -0.634917 0.905519 -0.685613 +-0.121803 0.309614 -0.439202 0.628071 -0.0926778 -0.851379 0.305925 -0.943568 -0.27987 -0.208549 0.389786 -0.182272 -0.369135 -0.401559 0.543022 0.820157 -0.552891 0.70806 -0.505505 0.0363481 0.917294 0.180776 0.797243 -0.155519 +0.987591 -0.804621 -0.0451149 0.707513 -0.0852496 -0.403647 0.143769 0.217077 -0.807544 -0.526058 -0.825002 -0.287532 -0.0639296 -0.277821 -0.424433 0.704086 -0.455906 0.0836881 -0.904561 -0.0130253 -0.824808 0.960451 -0.854613 0.877063 +0.410899 -0.489194 0.197309 -0.239281 0.252786 -0.436606 -0.693697 0.302158 0.345469 -0.121982 -0.985025 -0.469999 -0.159397 -0.958697 -0.503597 -0.0570772 -0.643512 0.124497 -0.636889 0.833751 0.601808 0.127274 -0.635305 0.116188 +-0.344765 -0.54068 0.729006 0.256254 0.236452 0.276822 -0.331982 -0.549467 0.480215 0.0429096 -0.761029 -0.69605 -0.806518 0.762865 -0.724619 0.733771 -0.824438 -0.421741 0.832296 -0.714854 -0.183035 -0.929906 -0.267105 0.765464 +-0.989181 0.34012 -0.570541 -0.472543 0.0433332 0.228234 0.958638 0.93003 -0.622679 0.585782 0.996397 -0.0381906 -0.296187 0.918238 -0.655685 -0.657119 -0.748519 0.806815 -0.356717 -0.358066 -0.758106 0.671639 -0.167651 -0.922895 +0.651236 0.114973 0.455217 0.2533 -0.207057 0.973506 0.190944 -0.537898 0.490617 0.356847 0.628667 0.809938 0.785344 -0.795463 -0.104889 -0.335007 -0.952431 -0.256157 0.391796 -0.859644 0.185413 0.249087 -0.161793 0.456374 +0.823237 0.0660907 0.670911 -0.0196943 0.953384 0.622643 -0.651758 -0.46649 0.601208 -0.203061 -0.198042 -0.370193 0.0637649 0.723206 0.362898 0.522055 0.961876 -0.874302 -0.554425 -0.720571 0.374732 -0.292425 0.982238 0.180286 +-0.67042 0.938033 0.825911 0.329615 0.907242 -0.477585 -0.187197 -0.635437 -0.401585 0.37307 0.902302 -0.75401 0.779322 -0.135702 -0.360799 0.541107 0.970163 -0.087038 0.61325 -0.489054 0.29052 0.118601 0.933494 0.879498 +0.947803 0.804047 -0.420756 -0.142121 0.371581 -0.324479 0.913367 -0.820375 -0.253509 0.575611 0.386751 0.546458 0.864873 0.572499 -0.458831 0.580633 0.551177 0.770051 -0.0184301 -0.223443 0.718767 -0.00961798 0.775308 0.708749 +-0.55879 -0.787786 -0.712572 0.685214 0.136325 -0.502881 -0.86881 0.239054 -0.900514 0.426322 -0.189277 -0.461075 -0.0850652 -0.729662 -0.427768 0.570583 0.789148 -0.773304 0.269972 -0.876668 0.428302 0.51795 0.404572 -0.545283 +-0.78576 -0.830897 -0.85279 -0.434623 0.377523 -0.917308 -0.053388 0.476263 -0.288325 0.486763 -0.886821 0.87186 -0.541315 -0.0512428 -0.965794 -0.625601 0.0429806 -0.555798 0.813266 0.790522 -0.381911 0.411129 -0.128471 0.136918 +-0.302129 0.214935 -0.153924 -0.837429 0.209801 -0.0938269 -0.879782 -0.286335 -0.442722 0.563151 0.116801 -0.241968 -0.484324 -0.238175 -0.407557 -0.332613 0.183202 0.944856 0.117631 0.0916106 0.0671428 0.220972 0.996019 0.0508119 +-0.485788 -0.169601 0.951256 -0.55033 0.808006 -0.039263 -0.63525 -0.790641 0.236432 0.432869 0.662646 0.0164456 -0.119231 -0.296113 -0.37043 0.300351 0.857671 -0.103119 0.613973 -0.915533 0.239586 -0.310138 0.208331 0.287848 +0.809375 -0.486575 -0.977637 0.207954 0.0486223 -0.960812 -0.533625 -0.0852435 0.961812 -0.246218 -0.367779 -0.321312 -0.473957 0.670971 -0.652866 -0.265764 -0.615993 -0.752535 0.834631 -0.535882 0.712523 0.661345 0.257741 -0.134337 +0.42038 -0.607873 -0.134162 -0.0545924 0.731644 -0.648882 0.884794 -0.536509 -0.655662 0.141923 0.613988 -0.946254 -0.622849 -0.71063 -0.417225 -0.881053 0.920436 -0.415262 0.487881 -0.471068 0.573575 -0.0222703 0.0378631 0.463969 +0.597617 -0.426952 0.27604 0.694134 -0.0222594 -0.54334 -0.184852 0.168388 0.0644971 -0.915204 0.294415 0.351135 -0.802236 -0.591315 -0.843938 -0.473297 -0.815802 -0.549953 -0.779155 0.585538 -0.724672 -0.00848234 0.528475 0.723863 +0.671246 -0.213055 -0.554059 -0.429014 0.150415 0.176389 0.907427 -0.922137 -0.793755 0.112513 0.366839 0.315655 -0.0460263 -0.489416 -0.233318 -0.0283238 -0.438808 -0.722203 -0.533916 0.87792 -0.372736 -0.0819163 -0.543831 0.127992 +0.805906 -0.216765 -0.365329 -0.344837 -0.415764 -0.222698 0.612261 -0.586827 0.600607 -0.00678074 0.456364 0.70747 -0.591998 -0.388413 -0.912355 -0.962378 -0.945276 -0.900912 0.481427 -0.771683 -0.206429 0.945997 0.692605 0.586974 +-0.937592 0.658721 0.337391 0.211286 0.229456 -0.745827 0.765702 0.351002 0.403541 -0.674389 0.133345 0.382925 0.142445 0.659041 -0.367358 -0.305363 -0.350487 -0.391122 -0.064006 0.28299 -0.672911 0.449927 -0.536104 0.150244 +-0.705449 0.179201 0.160426 0.618516 -0.825119 0.160273 0.492719 0.626372 0.642252 0.164823 0.549518 -0.993585 -0.234676 0.32792 0.665873 0.370315 -0.499061 -0.747001 0.167461 -0.580055 -0.563815 -0.0298521 0.0686789 -0.992054 +-0.537645 0.972134 -0.925409 -0.321733 0.326599 0.639723 0.0726297 -0.0259796 -0.669437 -0.529171 0.587522 -0.470374 -0.809908 0.988399 -0.192096 -0.867874 0.702148 0.210057 0.531216 0.672936 -0.964391 -0.493888 0.283944 0.0349585 +-0.289575 -0.869923 -0.528285 -0.501663 0.249759 0.339078 0.59757 -0.0828879 -0.332396 -0.951164 0.909761 0.454542 -0.759561 0.187368 -0.357086 -0.385487 0.111955 0.634815 -0.824721 0.882836 -0.501703 -0.84331 0.40743 -0.37105 +-0.705518 -0.547189 -0.854123 0.593536 0.281589 -0.118777 -0.813579 -0.241564 -0.174456 -0.907265 0.643905 -0.634554 -0.947222 0.804095 0.617487 -0.957668 -0.521206 0.184549 0.703334 0.387218 -0.719965 0.0334226 0.304629 0.830347 +0.914403 0.109022 0.267851 0.609395 -0.171172 0.786141 -0.830408 0.174771 0.6504 0.818175 0.015236 -0.483101 0.142805 -0.971931 -0.328937 0.772187 0.177555 0.310042 -0.0572684 0.403904 -0.0910475 -0.961225 -0.732882 0.577534 +0.830431 0.399068 0.616561 -0.681859 -0.265086 0.126643 -0.987372 0.691217 0.556301 -0.776344 -0.928538 0.609496 -0.0425682 0.945654 0.499128 0.152159 0.0695773 -0.464052 0.183 0.600996 -0.122054 -0.978298 0.914596 -0.883019 +0.0545274 -0.608602 -0.704501 -0.479449 -0.501449 0.186015 0.585085 0.826786 0.835944 0.729328 0.63452 0.0806251 0.537923 0.345217 0.469575 -0.0819347 -0.506166 0.713567 -0.441562 -0.685674 -0.0966243 -0.104525 0.45513 -0.760667 +-0.40962 -0.0355434 -0.823678 -0.584909 -0.397404 0.708305 0.2351 0.0950936 0.424637 0.981689 -0.283576 -0.860711 0.664715 -0.980088 0.444003 0.837854 0.198099 -0.721795 0.876467 0.870037 0.571302 0.35162 0.414114 -0.666017 +-0.41262 -0.704546 -0.817469 0.332227 0.0148846 -0.32744 -0.392716 0.364203 -0.0443745 -0.576235 -0.0732942 0.997281 0.459054 0.119127 0.39287 -0.859538 0.444805 -0.71638 0.833636 0.334389 -0.0388563 -0.392498 -0.12598 -0.986657 +-0.489354 0.266518 -0.789123 0.155223 0.430465 -0.141941 -0.887008 -0.663229 0.609728 -0.33042 -0.0632234 -0.618584 -0.164212 -0.321316 0.954138 0.0237305 -0.196465 -0.704081 -0.229122 0.202069 0.748914 0.236695 0.884353 -0.708488 + diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp index 11dd534..39524ea 100644 --- a/Project2-Character-Recognition/src/main.cpp +++ b/Project2-Character-Recognition/src/main.cpp @@ -1,4 +1,4 @@ -/** +/** * @file main.cpp * @brief Stream compaction test program * @authors Kai Ninomiya @@ -10,143 +10,127 @@ #include #include #include "testing_helpers.hpp" +#include +#include +#include +#include +#include const int SIZE = 1 << 8; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two -int *a = new int[SIZE]; -int *b = new int[SIZE]; -int *c = new int[SIZE]; +//int *a = new int[SIZE]; +//int *b = new int[SIZE]; +//int *c = new int[SIZE]; + +//size of the three buffer +const int input_count = 2; +const int size_INPUT = 4; +const int size_HiD = 2; +const int size_OUTPUT = 2; +const int size_WEI1 = size_INPUT * size_HiD; +const int size_WEI2 = size_HiD * size_OUTPUT; + +//create three buffers +float *input_ = new float[input_count * size_INPUT]; +float *hidden_ = new float[size_HiD]; +float *output_ = new float[size_OUTPUT]; +float *real_ = new float[input_count * size_OUTPUT]; + +//weights buffer +float *w1= new float[size_WEI1]; +float *w2 = new float[size_WEI2]; + +///real number read +const int character_num = 52; +const int feature_num = 10201; +const int class_num = 52; -int main(int argc, char* argv[]) { - // Scan tests +int main(int argc, char* argv[]) { printf("\n"); printf("****************\n"); - printf("** SCAN TESTS **\n"); + printf("** Recognize TESTS **\n"); printf("****************\n"); - genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - // initialize b using StreamCompaction::CPU::scan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. - // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); - printDesc("cpu scan, power-of-two"); - StreamCompaction::CPU::scan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(SIZE, b, true); - - zeroArray(SIZE, c); - printDesc("cpu scan, non-power-of-two"); - StreamCompaction::CPU::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(NPOT, b, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("naive scan, power-of-two"); - StreamCompaction::Naive::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ - - zeroArray(SIZE, c); - printDesc("naive scan, non-power-of-two"); - StreamCompaction::Naive::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, power-of-two"); - StreamCompaction::Efficient::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient scan, non-power-of-two"); - StreamCompaction::Efficient::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, power-of-two"); - StreamCompaction::Thrust::scan(SIZE, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); - printCmpResult(SIZE, b, c); - - zeroArray(SIZE, c); - printDesc("thrust scan, non-power-of-two"); - StreamCompaction::Thrust::scan(NPOT, c, a); - printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); - printCmpResult(NPOT, b, c); - - printf("\n"); - printf("*****************************\n"); - printf("** STREAM COMPACTION TESTS **\n"); - printf("*****************************\n"); - - // Compaction tests - - genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; - printArray(SIZE, a, true); - - int count, expectedCount, expectedNPOT; - - // initialize b using StreamCompaction::CPU::compactWithoutScan you implement - // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct. - zeroArray(SIZE, b); - printDesc("cpu compact without scan, power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedCount = count; - printArray(count, b, true); - printCmpLenResult(count, expectedCount, b, b); - - zeroArray(SIZE, c); - printDesc("cpu compact without scan, non-power-of-two"); - count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - expectedNPOT = count; - printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - zeroArray(SIZE, c); - printDesc("cpu compact with scan"); - count = StreamCompaction::CPU::compactWithScan(SIZE, c, a); - printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); - printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, power-of-two"); - count = StreamCompaction::Efficient::compact(SIZE, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedCount, b, c); - - zeroArray(SIZE, c); - printDesc("work-efficient compact, non-power-of-two"); - count = StreamCompaction::Efficient::compact(NPOT, c, a); - printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); - printCmpLenResult(count, expectedNPOT, b, c); - - system("pause"); // stop Win32 console from closing on exit - delete[] a; - delete[] b; - delete[] c; + /////////////////simple test/////////////////////////// + //printDesc("input"); + //genArrayf(size_INPUT * input_count, input_, 9); // Leave a 0 at the end to test that edge case + zeroArrayf(size_INPUT * input_count, input_); + input_[0] = 6.0; + input_[1] = 1.0; + input_[2] = 2.0; + input_[3] = 5.0; + input_[4] = 4.0; + input_[5] = 0.0; + input_[6] = 3.0; + input_[7] = 0.0; + //printArrayf(size_INPUT * input_count, input_, true); + //printDesc("real"); + zeroArrayf(input_count * size_OUTPUT, real_); + real_[1] = 1.0; + real_[3] = 1.0; + //printArrayf(input_count * size_OUTPUT, real_, true); + + zeroArrayf(size_HiD, hidden_); + zeroArrayf(size_OUTPUT, output_); + //printDesc("initial hidden"); + //printArrayf(size_HiD, hidden_, true); + //printDesc("initial output"); + //printArrayf(size_OUTPUT, output_, true); + + //CharacterRecognition::build_network(2, 4, 2, 2, 0.1, 0.5); + //CharacterRecognition::train(input_, real_, 10); + + /////////////////test with image/////////////////////////// + //load the img data info + float *input = new float [character_num * feature_num]; + float *real = new float[character_num * class_num](); + + std::string filename; + std::string pre; + std::string path = "D:\\study\\2019fall\\cis565\\Project2-Number-Algorithms\\Project2-Character-Recognition\\data-set\\"; + //std::string p; + //read character_num times + int real_ind = 0; + int input_ind = 0; + for (int i = 1; i <= character_num; i++) { + //set the whole path name + if (i >= 10) { + pre = std::to_string(i); + } else { + pre = std::to_string(0) + std::to_string(i); + } + filename = pre + "info.txt"; + std::ifstream f(path + filename, std::ios::in); + std::string line; + if (!f) {//false open + std::cout << "error opening source file." << std::endl; + return 0; + } + + //class index + std::getline(f, line); + int cl = std::stoi(line); + //std::cout << "class: " << cl << std::endl; + real[real_ind * class_num + cl - 1] = 1.0; + //std::cout << "ind: " << real_ind * class_num + cl - 1 << std::endl; + real_ind++; + + //10201 + std::getline(f, line); + + int index = 0; + while (index < feature_num) { + std::string x; + f >> x; + input[input_ind++] = std::stof(x); + index++; + //std::cout << "data: " << x << std::endl; + } + f.close(); + } + + CharacterRecognition::build_network(52, 10201, 52, 25, 0.1, 0.5); + CharacterRecognition::train(input, real, 40); + CharacterRecognition::test(input, 30, 2); } diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md index 0e38ddb..dcdf628 100644 --- a/Project2-Stream-Compaction/README.md +++ b/Project2-Stream-Compaction/README.md @@ -3,12 +3,179 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yan Dong + - [LinkedIn](https://www.linkedin.com/in/yan-dong-572b1113b/) + - [personal website](coffeier.com) + - [github](https://github.com/coffeiersama) +* Tested on: Windows 10, i7-8750 @ 2.22GHz (12CPUs) 16GB, GTX 1060 14202MB (OMEN 15-dc0xxx) -### (TODO: Your README) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +[Result](#features) - [Runtime Analysis](#analysis) - [Extra](#extra) + + + +## Features + +##### CPU: + +-Scan + +-Stream Compaction + +##### GPU: + +-Naive Scan + +-Work-Efficient Scan + +-Work-Efficient Stream Compaction + +-Thrust Scan + +###### Extra: + +-Why is CPU faster than Work-efficient? + +-Radix Sort algorithm + + + +``` +**************** +** SCAN TESTS ** +**************** + [ 4 39 13 14 6 31 41 35 29 20 8 20 16 ... 19 0 ] +SIZE: 512, NPOT:509 +==== cpu scan, power-of-two ==== + elapsed time: 0.0009ms (std::chrono Measured) +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0009ms (std::chrono Measured) + passed +==== naive scan, power-of-two ==== + elapsed time: 0.044032ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.043008ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.083968ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.08192ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.091136ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.06512ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 1 1 0 2 0 2 0 0 2 0 3 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.0019ms (std::chrono Measured) + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.0013ms (std::chrono Measured) + passed +==== cpu compact with scan ==== + elapsed time: 0.0054ms (std::chrono Measured) + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.146432ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.137216ms (CUDA Measured) + passed +==== radix sort ==== + [ 62 63 17 9 8 6 12 46 28 0 58 16 47 ... 62 41 ] + [ 0 6 8 8 9 12 16 17 17 28 32 32 41 ... 62 63 ] +``` + + + +## Analysis + +![](img/blocksize.png) + +As we increase the block size, there is no obvious change in the time cost. So I think just change change the block size cannot improve the performance. Some other time consuming things a lot of time(such as read and write global memory). + +![](img/arraysize.png) + +As we increase the array size, most of the line increase as well. As for CPU, the 2 lines are so near, and pot and npot do not affect so clearly. And in the Graph, the CPU is the fastest method. + +For GPU, first is the thrust. As you can see the thrust lines are most written lines. I have tried the same parameters for running several times, and find different thrust output. This really makes me confused. Anyway their time consuming are unstable. + +For Naive, the npot line is always lower than the pot one. Although it is not so obvious difference. I think this is because we deal with less number in the npot case. Finally, for the work efficient method. I the npot line sometimes higher than the pot one. I think this is cased by the operation and more memory access in dealing with npot to pot when we meet the npot array. + + + +![](img/beforeandafter.png) + +Here is a graph I show the effect of making the extra1. After remove the lazy thread, we have a stable time saving result! + +## Questions + +##### Guess at what might be happening inside the Thrust implementation + +The thrust result always vibrate between numbers, but as in my testing, the speed of it is between Naive and Work Efficient method. I think maybe they allocate less memory and use shared memory to save same latency. + +##### Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation? + +I think is yes. + +For the CPU, I think the computation is the main bottlenecks, since you need to do a for loop. And that is the improvement for GPU. + +For GPU Naive Scan, in the slides, we know that although the numbers of add is o(nlog2(n)), the whole complexity is o(log2(n)), which is smaller than the o(n). But in my test... the CPU is always faster than the GPU. So I think is the problem of memory I/O. And in this homework, I do not use the shared memory, so there will be some latency in interact with global memory. There can be other time consuming things, such as bank conflict and SM resource problem, happens, Work Efficient also suffer from these and take more memory operations than Naive, I think that's why it is time costing than Naive. + +As for thrust, I am not sure...maybe memory I/O? + +## Extra + +##### Why is My GPU Approach So Slow? + +this is because when I first implement the work efficient approach, not every thread is used. + +Here is the up sweep method in the algorithm, + +when d=0, we need to deal with the whole array, + +when d=1, we just need to deal with the half data of the array. + +So, to solve this, change the index of data to be worked by GPU, make it jump. + +![](img/up.png) + +Here is a comparison for before and after changing: + +###### Before + +![](img/before.png) + +###### after + +![](img/after.png) + + + +##### Radix sort + +just follow the slides from the course. + +How to call: + +``` +genArray(Sort_Size, ra, 64); +zeroArray(Sort_Size, raout); +StreamCompaction::Radix::radix_sort(Sort_Size, 6, raout, ra); +printDesc("radix sort"); +printArray(Sort_Size, ra, true); +printArray(Sort_Size, raout, true); +``` + +Result: + +![](img/radix.png) \ No newline at end of file diff --git a/Project2-Stream-Compaction/img/after.png b/Project2-Stream-Compaction/img/after.png new file mode 100644 index 0000000..d8b2c83 Binary files /dev/null and b/Project2-Stream-Compaction/img/after.png differ diff --git a/Project2-Stream-Compaction/img/arraysize.png b/Project2-Stream-Compaction/img/arraysize.png new file mode 100644 index 0000000..e59861b Binary files /dev/null and b/Project2-Stream-Compaction/img/arraysize.png differ diff --git a/Project2-Stream-Compaction/img/before.png b/Project2-Stream-Compaction/img/before.png new file mode 100644 index 0000000..648035b Binary files /dev/null and b/Project2-Stream-Compaction/img/before.png differ diff --git a/Project2-Stream-Compaction/img/beforeandafter.png b/Project2-Stream-Compaction/img/beforeandafter.png new file mode 100644 index 0000000..48c7e8e Binary files /dev/null and b/Project2-Stream-Compaction/img/beforeandafter.png differ diff --git a/Project2-Stream-Compaction/img/blocksize.png b/Project2-Stream-Compaction/img/blocksize.png new file mode 100644 index 0000000..635397c Binary files /dev/null and b/Project2-Stream-Compaction/img/blocksize.png differ diff --git a/Project2-Stream-Compaction/img/radix.png b/Project2-Stream-Compaction/img/radix.png new file mode 100644 index 0000000..bb08ab8 Binary files /dev/null and b/Project2-Stream-Compaction/img/radix.png differ diff --git a/Project2-Stream-Compaction/img/scan.png b/Project2-Stream-Compaction/img/scan.png new file mode 100644 index 0000000..92697b7 Binary files /dev/null and b/Project2-Stream-Compaction/img/scan.png differ diff --git a/Project2-Stream-Compaction/img/up.png b/Project2-Stream-Compaction/img/up.png new file mode 100644 index 0000000..140d14b Binary files /dev/null and b/Project2-Stream-Compaction/img/up.png differ diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp index d016553..9659348 100644 --- a/Project2-Stream-Compaction/src/main.cpp +++ b/Project2-Stream-Compaction/src/main.cpp @@ -11,13 +11,21 @@ #include #include #include +#include #include "testing_helpers.hpp" + +//0001-> 10000000 const int SIZE = 1 << 8; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two +const int Sort_Size = 20; +//three int array int *a = new int[SIZE]; int *b = new int[SIZE]; int *c = new int[SIZE]; +//test radix +int *ra = new int[Sort_Size]; +int *raout = new int[Sort_Size]; int main(int argc, char* argv[]) { // Scan tests @@ -27,22 +35,41 @@ int main(int argc, char* argv[]) { printf("** SCAN TESTS **\n"); printf("****************\n"); + //size-1 -> generate number, a->array pointer, 50->max value of each number genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case - a[SIZE - 1] = 0; + a[SIZE - 1] = 0;//the last one is 0 printArray(SIZE, a, true); + //std::cout << "16: "<< ilog2ceil(16) << std::endl; + //std::cout << "13: "<< ilog2ceil(13) << std::endl; + //std::cout << "14: " << ilog2ceil(14) << std::endl; + //int aa = (5 >> 0) & 1; + //int aa2 = (5 >> 1) & 1; + //int aa3 = (5 >> 2) & 1; + //int aa4 = (5 >> 3) & 1; + //std::cout << "5<<0: " << aa << std::endl; + //std::cout << "5<<1: " << aa2 << std::endl; + //std::cout << "5<<2: "<< aa3 << std::endl; + //std::cout << "5<<3: " << aa4 << std::endl; + std::cout << "SIZE: " << SIZE << ", NPOT:" << NPOT << std::endl; // initialize b using StreamCompaction::CPU::scan you implement // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct. // At first all cases passed because b && c are all zeroes. - zeroArray(SIZE, b); + zeroArray(SIZE, b);//initial as zero printDesc("cpu scan, power-of-two"); + //scan StreamCompaction::CPU::scan(SIZE, b, a); + //print the execute time printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + //print array b printArray(SIZE, b, true); + //zero initial array c zeroArray(SIZE, c); printDesc("cpu scan, non-power-of-two"); + //NPOT = SIZE - 3 StreamCompaction::CPU::scan(NPOT, c, a); + //print execute time printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); printArray(NPOT, b, true); printCmpResult(NPOT, b, c); @@ -51,7 +78,7 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan @@ -64,35 +91,35 @@ int main(int argc, char* argv[]) { printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -147,8 +174,27 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); - system("pause"); // stop Win32 console from closing on exit + //test radix sort + //ra[0] = 4;//100 + //ra[1] = 7;//111 + //ra[2] = 2;//010 + //ra[3] = 6;//110 + //ra[4] = 3;//011 + //ra[5] = 5;//101 + //ra[6] = 1;//001 + //ra[7] = 0;//000 + //ra[8] = 7;//111 + //ra[9] = 1;//001 + genArray(Sort_Size, ra, 64); + zeroArray(Sort_Size, raout); + StreamCompaction::Radix::radix_sort(Sort_Size, 6, raout, ra); + printArray(Sort_Size, ra, true); + printArray(Sort_Size, raout, true); + + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; delete[] c; + delete[] ra; + delete[] raout; } diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt index cdbef77..f75b9dd 100644 --- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt +++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt @@ -9,9 +9,11 @@ set(SOURCE_FILES "efficient.cu" "thrust.h" "thrust.cu" + "radix.h" + "radix.cu" ) cuda_add_library(stream_compaction ${SOURCE_FILES} - OPTIONS -arch=sm_20 + OPTIONS -arch=sm_60 ) diff --git a/Project2-Stream-Compaction/stream_compaction/common.cu b/Project2-Stream-Compaction/stream_compaction/common.cu index 2ed6d63..5a6fc01 100644 --- a/Project2-Stream-Compaction/stream_compaction/common.cu +++ b/Project2-Stream-Compaction/stream_compaction/common.cu @@ -1,39 +1,52 @@ #include "common.h" void checkCUDAErrorFn(const char *msg, const char *file, int line) { - cudaError_t err = cudaGetLastError(); - if (cudaSuccess == err) { - return; - } + cudaError_t err = cudaGetLastError(); + if (cudaSuccess == err) { + return; + } - fprintf(stderr, "CUDA error"); - if (file) { - fprintf(stderr, " (%s:%d)", file, line); - } - fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); - exit(EXIT_FAILURE); + fprintf(stderr, "CUDA error"); + if (file) { + fprintf(stderr, " (%s:%d)", file, line); + } + fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err)); + exit(EXIT_FAILURE); } namespace StreamCompaction { - namespace Common { + namespace Common { + /** + * Maps an array to an array of 0s and 1s for stream compaction. Elements + * which map to 0 will be removed, and elements which map to 1 will be kept. + */ + __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { + // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) { + if (idata[index] == 0) { + bools[index] = 0; + } else { + bools[index] = 1; + } + } + } - /** - * Maps an array to an array of 0s and 1s for stream compaction. Elements - * which map to 0 will be removed, and elements which map to 1 will be kept. - */ - __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO - } + /** + * Performs scatter on an array. That is, for each element in idata, + * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. + */ + __global__ void kernScatter(int n, int *odata, + const int *idata, const int *bools, const int *indices) { + // TODO + int index = (blockIdx.x * blockDim.x) + threadIdx.x; + if (index < n) { + if (bools[index] == 1) { + odata[indices[index]] = idata[index]; + } + } + } - /** - * Performs scatter on an array. That is, for each element in idata, - * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]]. - */ - __global__ void kernScatter(int n, int *odata, - const int *idata, const int *bools, const int *indices) { - // TODO - } - - } + } } diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu index a2d3e6c..1322b2d 100644 --- a/Project2-Stream-Compaction/stream_compaction/cpu.cu +++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu @@ -12,14 +12,24 @@ namespace StreamCompaction { return timer; } + void my_scan(int n, int *odata, const int *idata) { + odata[0] = 0; + for (int i = 1; i < n; i++) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + } + /** * CPU scan (prefix sum). * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ + //compute an exclusive prefix sum void scan(int n, int *odata, const int *idata) { + //odata is b, idata is a timer().startCpuTimer(); // TODO + my_scan(n, odata, idata); timer().endCpuTimer(); } @@ -28,11 +38,19 @@ namespace StreamCompaction { * * @returns the number of elements remaining after compaction. */ + //remove 0 int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int index = 0; + for (int i = 0; i < n; i++) { + if (idata[i] != 0) { + odata[index] = idata[i]; + index++; + } + } timer().endCpuTimer(); - return -1; + return index; } /** @@ -43,8 +61,29 @@ namespace StreamCompaction { int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + //map + int* map_data = new int[n]; + for (int i = 0; i < n; i++) { + if (idata[i] == 0) { + map_data[i] = 0; + } else { + map_data[i] = 1; + } + } + //scan + int* scan_out = new int[n]; + //the last index of each number is the non-zero index + my_scan(n, scan_out, map_data); + //scatter + int count = 0; + for (int i = 0; i < n; i++) { + if (map_data[i] == 1) { + odata[scan_out[i]] = idata[i]; + count++; + } + } timer().endCpuTimer(); - return -1; + return count; } } } diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu index 2db346e..92a831a 100644 --- a/Project2-Stream-Compaction/stream_compaction/efficient.cu +++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu @@ -1,40 +1,244 @@ -#include +#include #include #include "common.h" #include "efficient.h" namespace StreamCompaction { - namespace Efficient { - using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { - static PerformanceTimer timer; - return timer; - } - - /** - * Performs prefix-sum (aka scan) on idata, storing the result into odata. - */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - } - - /** - * Performs stream compaction on idata, storing the result into odata. - * All zeroes are discarded. - * - * @param n The number of elements in idata. - * @param odata The array into which to store elements. - * @param idata The array of elements to compact. - * @returns The number of elements remaining after compaction. - */ - int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); - // TODO - timer().endGpuTimer(); - return -1; - } - } + namespace Efficient { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + // TODO: __global__ + __global__ void kernUpSweep(int n, int d, int *data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + int offset = (int)(powf(2.0, 1.0*(d + 1))) - 1;//2^(d+1) - 1 + int offset2 = (int)(powf(2.0, 1.0*d)) - 1;//2^d-1 + if (index % (offset + 1) == 0) { + data[index + offset] += data[index + offset2]; + } + } + //only get the work index and do + __global__ void kernUpSweep2(int n, int time, int* data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + int i = (index + 1) * time - 1; + int j = i - time / 2; + data[i] += data[j]; + } + + __global__ void kernDownSweep(int n, int d, int *data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + //do the naive scan + int offset = (int)(powf(2.0, 1.0 * (d + 1))) - 1;//2^(d+1) - 1 + int offset2 = (int)(powf(2.0, 1.0 * d)) - 1;//2^d-1 + if (index % (offset + 1) == 0) { + int t = data[index + offset2]; + data[index + offset2] = data[index + offset]; + data[index + offset] += t; + } + } + + __global__ void kernDownSweep2(int n, int time, int* data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + int i = (index + 1) * time - 1; + int j = i - time / 2; + int t = data[i - time / 2]; + data[j] = data[i]; + data[i] += t; + } + + + //judge is pow of 2? + //https://stackoverflow.com/questions/600293/how-to-check-if-a-number-is-a-power-of-2 + bool IsPowerOfTwo(int number) { + if (number == 0) + return false; + for (int power = 1; power > 0; power = power << 1) { + // This for loop used shifting for powers of 2, meaning + // that the value will become 0 after the last shift + // (from binary 1000...0000 to 0000...0000) then, the 'for' + // loop will break out. + if (power == number) + return true; + if (power > number) + return false; + } + return false; + } + + __global__ void reviseElemnt(int index, int * data, int value) { + data[index] = value; + } + + __global__ void Init(int n, int *data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + data[index] = 0; + } + + + void my_scan(dim3 fullBlocksPerGrid, int pow2n, int d, int *dev_data) { + //up + for (int i = 0; i <= d; i++) { + kernUpSweep << > > (pow2n, i, dev_data); + } + reviseElemnt << > > (pow2n - 1, dev_data, 0); + //down + for (int i = d; i >= 0; i--) { + kernDownSweep << > > (pow2n, i, dev_data); + } + } + + bool time = true; + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void scan(int n, int *odata, const int *idata) { + //TODO + int *dev_data; + bool is_pow2 = IsPowerOfTwo(n); + int d = ilog2ceil(n) - 1;//log2 n + int pow2n = n; + if (!is_pow2) { + pow2n = static_cast(pow(2.0, 1.0 * (d + 1))); + } + dim3 fullBlocksPerGrid((pow2n + blockSize - 1) / blockSize); + + //malloc memory + cudaMalloc((void**)&dev_data, pow2n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + Init << > > (pow2n, dev_data); + //mempy + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice);//host to device + checkCUDAError("cudaMemcpy dev_idata failed!"); + + //start time + if (time) { + timer().startGpuTimer(); + } + + //my_scan(fullBlocksPerGrid, pow2n, d, dev_data); + //up + //before change each block has 128,1,1 threads (128) + //each grid has (pow2n + blockSize - 1) / blockSize block (2) + for (int i = 0; i <= d; i++) { + int new_block = static_cast(powf(2.0f, 1.0 * (ilog2ceil(n) - i - 1))); + dim3 new_blockpergrid((new_block + blockSize - 1) / blockSize); + //kernUpSweep << > > (pow2n, i, dev_data); + int off = powf(2.0, i + 1); + kernUpSweep2 << > > (new_block, off, dev_data); + } + reviseElemnt << > > (pow2n - 1, dev_data, 0); + //down + for (int i = d; i >= 0; i--) { + int new_block = static_cast(powf(2.0f, 1.0 * (ilog2ceil(n) - i - 1))); + dim3 new_blockpergrid((new_block + blockSize - 1) / blockSize); + int off = powf(2.0, i + 1); + //kernDownSweep << > > (pow2n, i, dev_data); + kernDownSweep2 << > > (new_block, off, dev_data); + } + + if (time) { + timer().endGpuTimer(); + } + //end gpu time + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost);//get the result + checkCUDAError("get odata failed!"); + + //free + cudaFree(dev_data); + } + + /** + * Performs stream compaction on idata, storing the result into odata. + * All zeroes are discarded. + * + * @param n The number of elements in idata. + * @param odata The array into which to store elements. + * @param idata The array of elements to compact. + * @returns The number of elements remaining after compaction. + */ + int compact(int n, int *odata, const int *idata) { + int *map_data; + int *scan_out; + int *dev_idata; + int *dev_odata; + + bool is_pow2 = IsPowerOfTwo(n); + int d = ilog2ceil(n) - 1;//log2 n + int pow2n = n; + if (!is_pow2) { + pow2n = static_cast(pow(2.0, 1.0 * (d + 1))); + } + dim3 fullBlocksPerGrid((pow2n + blockSize - 1) / blockSize); + + //malloc memory + cudaMalloc((void**)&map_data, pow2n * sizeof(int)); + checkCUDAError("cudaMalloc map_data failed!"); + Init << > > (pow2n, map_data); + cudaMalloc((void**)&scan_out, pow2n * sizeof(int)); + checkCUDAError("cudaMalloc scan_out failed!"); + Init << > > (pow2n, scan_out); + cudaMalloc((void**)&dev_idata, pow2n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + Init << > > (pow2n, dev_idata); + cudaMalloc((void**)&dev_odata, pow2n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + Init << > > (pow2n, dev_odata); + + //mempy + cudaMemcpy(dev_idata, idata, sizeof(int) * pow2n, cudaMemcpyHostToDevice);//host to device + checkCUDAError("cudaMemcpy dev_idata failed!"); + //cudaMemcpy(dev_odata, odata, sizeof(int) * pow2n, cudaMemcpyHostToDevice);//host to device + //checkCUDAError("cudaMemcpy dev_odata failed!"); + + timer().startGpuTimer(); + //to bool + StreamCompaction::Common::kernMapToBoolean << > > (pow2n, map_data, dev_idata); + //scan + time = false; + scan(pow2n, scan_out, map_data); + //scatter here is n,not pow2n + StreamCompaction::Common::kernScatter << > > (n, dev_odata, + dev_idata, map_data, scan_out); + // TODO + timer().endGpuTimer(); + + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);//get the result + checkCUDAError("get odata failed!"); + + int count = -1; + for (int i = 0; i < n; i++) { + if (odata[i] == 0) { + count = i; + break; + } + } + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(map_data); + cudaFree(scan_out); + + return count; + } + } } diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu index 4308876..ec454be 100644 --- a/Project2-Stream-Compaction/stream_compaction/naive.cu +++ b/Project2-Stream-Compaction/stream_compaction/naive.cu @@ -6,20 +6,89 @@ namespace StreamCompaction { namespace Naive { using StreamCompaction::Common::PerformanceTimer; - PerformanceTimer& timer() - { + PerformanceTimer& timer() { static PerformanceTimer timer; return timer; } + //initial + /*__global__ void kernInitialArray(int N, int * arr) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x;//threadindex + if (index >= N) { + return; + } + arr[index] = -1; + }*/ + // TODO: __global__ + __global__ void kernNaiveScan(int n, int d, int *odata, int* idata) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + //do the naive scan + int start = powf(2.0, 1.0*(d - 1));//2^(d-1) + if (index >= start) { + odata[index] = idata[index - start] + idata[index]; + } else { + odata[index] = idata[index]; + } + } + + __global__ void kernInclu2Exclu(int n, int *inclu, int* exclu) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + if (index == 0) { + exclu[index] = 0; + } else { + exclu[index] = inclu[index - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); + void scan(int n, int *odata, const int *idata) { + int *dev_idata; + int *dev_odata; + + //malloc memory + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_idata failed!"); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_odata failed!"); + + //mempy + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);//host to device + checkCUDAError("cudaMemcpy dev_idata failed!"); + cudaMemcpy(dev_odata, odata, sizeof(int) * n, cudaMemcpyHostToDevice);//host to device + checkCUDAError("cudaMemcpy dev_odata failed!"); + + int d = ilog2ceil(n);//log2 n + int* temp; + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + //start time + timer().startGpuTimer(); + for (int i = 1; i <= d; i++) { + kernNaiveScan << > > (n, i, dev_odata, dev_idata); + temp = dev_idata; + dev_idata = dev_odata; + dev_odata = temp; + } + //from inclusive to exclusive + kernInclu2Exclu << > > (n, dev_idata, dev_odata); // TODO timer().endGpuTimer(); + //end gpu time + + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);//get the result + checkCUDAError("get odata failed!"); + + //free + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/Project2-Stream-Compaction/stream_compaction/radix.cu b/Project2-Stream-Compaction/stream_compaction/radix.cu new file mode 100644 index 0000000..f2a72d4 --- /dev/null +++ b/Project2-Stream-Compaction/stream_compaction/radix.cu @@ -0,0 +1,145 @@ +#include +#include +#include +#include "common.h" +#include "radix.h" + +namespace StreamCompaction { + namespace Radix { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() { + static PerformanceTimer timer; + return timer; + } + + __global__ void Init(int n, int *data, int value) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + data[index] = 0; + } + + //generate b_array + __global__ void kernGetBandEArray(int n, int bit, const int* idata, int *b_arr, int* e_arr) { + int index = (blockIdx.x * blockDim.x) + threadIdx.x;//threadindex + if (index >= n) { + return; + } + if ((idata[index]>>bit) & 1) { + b_arr[index] = 1; + e_arr[index] = 0; + } else { + b_arr[index] = 0; + e_arr[index] = 1; + } + } + + __global__ void kernGetTArray(int n, const int *f_arr, int total, int* t_arr) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + t_arr[index] = index - f_arr[index] + total; + } + + __global__ void kernGetDArray(int n, const int *b_arr, const int *t_arr, const int *f_arr, int* d_arr) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + //if b is 1 -> t + d_arr[index] = b_arr[index] ? t_arr[index] : f_arr[index]; + } + + __global__ void kernRearrange(int n, int *d_arr, int *data) { + int index = threadIdx.x + (blockIdx.x * blockDim.x); + if (index >= n) { + return; + } + data[d_arr[index]] = data[index]; + } + + /** + * Performs prefix-sum (aka scan) on idata, storing the result into odata. + */ + void radix_sort(int n, int bits_num, int *odata, const int *idata) { + int* dev_data; + int* b_array; + int* e_array; + int* f_array; + int *t_array; + int *d_array; + + dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize); + + //malloc memory initial b and e with -1 and f with 0(for kern) + cudaMalloc((void**)&dev_data, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_data failed!"); + cudaMalloc((void**)&b_array, n * sizeof(int)); + checkCUDAError("cudaMalloc b_array failed!"); + Init << < fullBlocksPerGrid, blockSize >> > (n, b_array, -1); + cudaMalloc((void**)&e_array, n * sizeof(int)); + checkCUDAError("cudaMalloc e_array failed!"); + Init << < fullBlocksPerGrid, blockSize >> > (n, e_array, -1); + cudaMalloc((void**)&f_array, n * sizeof(int)); + checkCUDAError("cudaMalloc f_array failed!"); + Init<< < fullBlocksPerGrid, blockSize >> >(n, f_array, 0); + cudaMalloc((void**)&t_array, n * sizeof(int)); + checkCUDAError("cudaMalloc t_array failed!"); + Init << < fullBlocksPerGrid, blockSize >> > (n, t_array, -1); + cudaMalloc((void**)&d_array, n * sizeof(int)); + checkCUDAError("cudaMalloc d_array failed!"); + Init << < fullBlocksPerGrid, blockSize >> > (n, d_array, -1); + cudaDeviceSynchronize(); + + //mempy + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_data failed!"); + + //start time + timer().startGpuTimer(); + //execute sort bits_num times + int totalFlases = 0; + for (int i = 0; i < bits_num; i++) { + //generate b and e arr + kernGetBandEArray << > >(n, i, dev_data, b_array, e_array); + //generate f arr + StreamCompaction::Efficient::scan(n, f_array, e_array); + int *temp = new int[n]; + cudaMemcpy(temp, f_array, sizeof(int) * n, cudaMemcpyDeviceToHost); + /*std::cout << "f_array" << std::endl; + for (int i = 0; i < n; i++) { + std::cout << temp[i] << std::endl; + }*/ + int *temp2 = new int[n]; + cudaMemcpy(temp2, e_array, sizeof(int) * n, cudaMemcpyDeviceToHost); + /*std::cout << "e_array" << std::endl; + for (int i = 0; i < n; i++) { + std::cout << temp2[i] << std::endl; + }*/ + totalFlases = temp[n - 1] + temp2[n - 1]; + //std::cout <<"total: "<< totalFlases << std::endl; + //get t arr + kernGetTArray << > >(n, f_array, totalFlases, t_array); + //get d array + kernGetDArray << > >(n, b_array, t_array, f_array, d_array); + //scatter + kernRearrange << > >(n, d_array, dev_data); + } + timer().endGpuTimer(); + //end gpu time + + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost);//get the result + checkCUDAError("get odata failed!\n"); + + //free + cudaFree(dev_data); + cudaFree(b_array); + cudaFree(e_array); + cudaFree(f_array); + cudaFree(t_array); + cudaFree(d_array); + } + } +} diff --git a/Project2-Stream-Compaction/stream_compaction/radix.h b/Project2-Stream-Compaction/stream_compaction/radix.h new file mode 100644 index 0000000..c297d18 --- /dev/null +++ b/Project2-Stream-Compaction/stream_compaction/radix.h @@ -0,0 +1,12 @@ +#pragma once + +#include "common.h" +#include + +namespace StreamCompaction { + namespace Radix { + StreamCompaction::Common::PerformanceTimer& timer(); + + void radix_sort(int n, int bits_num, int *odata, const int *idata); + } +} diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu index 1def45e..92d0f05 100644 --- a/Project2-Stream-Compaction/stream_compaction/thrust.cu +++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu @@ -17,12 +17,21 @@ namespace StreamCompaction { /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ + //https://thrust.github.io/doc/group__prefixsums_ga7be5451c96d8f649c8c43208fcebb8c3.html void scan(int n, int *odata, const int *idata) { + thrust::host_vector temp(idata, idata + n); + thrust::device_vector dev_in(temp); + thrust::device_vector dev_out(n); + cudaDeviceSynchronize(); timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + //thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin()); timer().endGpuTimer(); + + cudaMemcpy(odata, thrust::raw_pointer_cast(&dev_out[0]), sizeof(int) * n, cudaMemcpyDeviceToHost);//get the result + checkCUDAError("get odata failed!"); } } } diff --git a/README.md b/README.md index 3a0b2fe..05f1ff1 100644 --- a/README.md +++ b/README.md @@ -3,14 +3,46 @@ CUDA Number Algorithms **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Yan Dong + - [LinkedIn](https://www.linkedin.com/in/yan-dong-572b1113b/) + - [personal website](coffeier.com) + - [github](https://github.com/coffeiersama) +* Tested on: Windows 10, i7-8750 @ 2.22GHz (12CPUs) 16GB, GTX 1060 14202MB (OMEN 15-dc0xxx) -### (TODO: Your README) + + +### Homework2 Link to the readmes of the other two subprojects. -Add anything else you think is relevant up to this point. -(Remember, this is public, so don't put anything here that you don't want to share with the world.) +[stream compaction](https://github.com/coffeiersama/Project2-Number-Algorithms/tree/master/Project2-Stream-Compaction) + +[character recognization](https://github.com/coffeiersama/Project2-Number-Algorithms/tree/master/Project2-Character-Recognition) + + + +### Some useful knowledge review + +##### shared_memory + +on-board memory(global memory, high latency) + +on-chip memory(shared memory, low latency) + +shared memory is faster than global memory about 20-30 times + +one block has a part of shared memory. + +###### Definition: + +extern "___shared___" int tile[];//do not know the shared memory size + +[Reference](https://www.cnblogs.com/1024incn/p/4605502.html) + + + +##### resolve the external symbol problem: + +the program can find the h file, but can not find the dll file. +so use link! \ No newline at end of file