Skip to content

dev-d/SparkAnonymizationToolkit

 
 

Repository files navigation

Spark Anonymization Toolkit

WORK IN PROGRESS################### #TO Run Execute Main Object with following arguments spark.master, no.of.cores, worker.memory, /input/path/folder/ [hdfs/tachyon/localdfs], input.filename

Note: the input.filename, should also have a input.filename.taxonomy file describing the taxonomy of categorical data. An example data set is under the data folder Example spark://master:1234 1 256m /home/antorweep/Documents/data/ data.dat

P.S. the Buckatization & Dichotomization phases respectively creates their own temp files under /input/path/folder/out/buckets & under /input/path/folder/out/ec these folders ../buckets and ../ecs needs to be deleted prior to 2nd run

#Solution Description 1. Incognito 1.1 Buckatization OK 1.2 Dichotomization OK 1.3 Redistribution OK 1.4 Recording OK 2. beta-likeness: 2.1 Buckatization: OK 2.2 - 2.4: same as incognito: OK 3. t-closeness 3.1 Buckatization: OK 3.2 Extension of Dichotomization from incognito: OK
3.3 - 3.4: same as incognito: OK TODO: Code documentation

#The following experiments evaluates the solution 1. Information Loss: Imp. OK 2. Similarity & Skewedness Attacks: Imp. OK 3. Accuracy on anonymized data
3.1 Association Rule Mining 3.2 Classification 4. Performance evaluation: 15-20 machines for simulated data with respect the number of allocated cores

About

Distributed anonymization algorithms for Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 89.0%
  • Shell 6.9%
  • R 3.9%
  • Java 0.2%