GitHub - erichpeterson/pgfim: Probabilistic Generalized Frequent Itemset Mining (PGFIM) in Uncertain Databases

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
R_scripts		R_scripts
datasets		datasets
Commandline.cpp		Commandline.cpp
Commandline.h		Commandline.h
Database.cpp		Database.cpp
Database.h		Database.h
DynamicCalc.cpp		DynamicCalc.cpp
DynamicCalc.h		DynamicCalc.h
EnumTree.cpp		EnumTree.cpp
EnumTree.h		EnumTree.h
Global.h		Global.h
IProbCalc.h		IProbCalc.h
LICENSE		LICENSE
Makefile		Makefile
Mine.cpp		Mine.cpp
Mine.h		Mine.h
README		README
TaxVertex.cpp		TaxVertex.cpp
TaxVertex.h		TaxVertex.h
Taxonomy.cpp		Taxonomy.cpp
Taxonomy.h		Taxonomy.h
TreeVertex.cpp		TreeVertex.cpp
TreeVertex.h		TreeVertex.h
Util.cpp		Util.cpp
Util.h		Util.h
main.cpp		main.cpp
run_test_series_eta.pl		run_test_series_eta.pl
run_test_series_tau.pl		run_test_series_tau.pl
tinystr.cpp		tinystr.cpp
tinystr.h		tinystr.h
tinyxml.cpp		tinyxml.cpp
tinyxml.h		tinyxml.h
tinyxmlerror.cpp		tinyxmlerror.cpp
tinyxmlparser.cpp		tinyxmlparser.cpp

Repository files navigation

IMPORTANT:
Two pieces of software must be installed to compile PGFIM:
1. The high-precision static library ARPREC, which one can download and install
from the following website: http://crd-legacy.lbl.gov/~dhbailey/mpdist/
2. The mathematics library Boost, which one can download and install from the
following website: http://www.boost.org/

COMPILING:
To compile on a Linux-based machine, modify the Makefile file, to your environment and
locations of the two software libraries installed above. One can find where the locations
are set in the Makefile, next to the CXX_CFLAGS and CXX_LDFLAGS variables.

Once your environment is set correctly, simply type 'make' at the command-line
in the directory of the source code, and the executable PGFIM should be created.

RUNNING:
./PGFIM -tau <TAU> -inputDB <INPUT_DB_FILENAME> -inputXML <INPUT_XML_FILENAME> [optional_args]

Sample Execution:
./PGFIM -tau 0.9 -minsup 3000 -inputDB input.data -inputXML taxonomy.xml

Command Line Options:
-minsup minimum support threshold: [0.0-Size of DB] default: 1
-tau Tau frequent probability threshold: (0.0-1.0] (Required)
-inputDB Input DB file name (Required)
-inputXML Input XML taxonomy (Required)
-calc Frequentness calculation method: {dynamic} default: dynamic
-output Output file name
-exec Execution stats file name
-print Print found itemsets: {0 | 1} (0 = false, 1 = true) default 0

DATABASE FILE & TAXONOMY FORMAT:
Attributes in the database file are to be space-delimited, and transactions/instances
should be new-line delimited. Currently, only the natural numbers can be used as
attribute names / items. Following each attribute name is a colon and then the
probability of that attribute occurring.

IMPORTANT: The last item and probability of each transaction must be followed by a space.

Example database with 3 transactions:
1:0.9 2:0.23 3:0.6
2:0.2 4:0.8
1:0.4 2:0.54

An XML file is used to represent the taxonomy to be used. See a sample taxonomy in the
datasets directory, as well as, example databases.

CITATION:
Erich A. Peterson and Peiyi Tang. Mining Probabilistic Generalized Frequent Itemsets in Uncertain Databases. In Proceedings of the 51tst ACM Southeast Conference (ACMSE), Savannah, GA, USA, April 2013.