-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The LMS data harvester is a program used to extract course data from a Learning Management System (LMS) to a database suitable for research. As the data is extracted from the source (LMS) database, it is scrubbed to remove user-identifying information. Specifically, references to the user in the source dataset are replaced with references to an enrolment in the destination dataset, and IP addresses are replaced with the name of the network owner. An IP address could potentially be used to identify a user. Replacing the IP address with the name of the network owner (as provided by a whois query to ARIN) should make the user anonymous, while still providing enough data to be useful in analysis. Enrolments are inserted into the destination database in random order, as a measure to further obscure the identity of the user. The identity of the user is stored in a separate schema in the destination database. This user-identifying information is intended to be purged once the course data is determined to be complete. While the program is structured to be extensible and possibly support extracting data from multiple Learning Management Systems, currently Moodle is the only supported LMS.
To be extensible, the code base is structured as a library with a small driver program. A complete representation of the domain model with the facilities for reading and writing to the data-stores is provided by the library. It could be used to develop a more sophisticated interface for performing the data extraction, or for operating upon the research database. The driver program is responsible for processing the main configuration file, loading the Moodle adaptor and specifying to the library which course is to be processed.
Processing the data for a course is a two stage process. During the first stage the data for the course is extracted from the source database and stored in memory. Once the load and extraction have been completed, the data in memory is synchronized with the destination database and written out. The code base is designed with the philosophy that is better to fail completely than to possibly write out incorrect data. Using a two-stage process allows the enrolments to be written out in random order, and allows the process to fail more gracefully, as errors are more likely to be found while extracting the course data from Moodle.
#Extracting course data from Moodle To successfully extract the data for a given course from the Moodle database the following actions need to be performed:
- Add any missing activity implementations
- Configure the source database
- Configure the destination database
- Configure the harvester's inputs
- Execute the harvester
Some of these actions, such as configuring the source and destination databases, will only need to be performed once, while others will need to be performed for each course. The sub-sections below discuss the details for each of these actions.
##Setting up the Activities The LMADataHarvester needs an Activity implementation for each module which has been added to the course which is to be processed. Any modules which were added to a course then later removed from the course must have an Activity implementation. If a module is found in a course which does not have an associated Activity implementation, the program will immediately terminate with an IllegalStateException. The Moodle log contains "stealth" activities which have a module id of 0, and for which there are no Activity implementations. These "activities" are automatically handled using the GenericActivity implementation and do not require specific Activity implementations. For the list of currently supported activities see the Supported Activities page. See Configuring Activities for instructions on how to add support for additional activities.
##Setting up the source database Course data is extracted from the source database. It is recommended that this database be a copy of the actual Moodle database, with the database user used by the harvester having read-only access. The database is expected to be a PostgreSQL database, however it should be possible to use a different database management system for the source database. See the PostgreSQL section of the dependencies for a discussion of how to use a different database management system. A sample configuration profile for the source database is contained in: conf/InputProfile.xml. This profile should be usable by filling in the database connection parameters. See Configuring Profiles for the details on the file format.
##Setting up the destination database Once a course has been processed, its data is written to the destination database. As such, the harvester needs read/write access to this database. The destination database must be pre-initialized with both the user and course schemas. Both of these schema files can be found under "src/main/sql." Like the source database, the destination database is expected to be PostgreSQL. See the PostgreSQL section of the dependencies for a discussion of how to use a different database management system. A sample configuration profile for the destination database is contained in: conf/OutputProfile.xml. This profile should be usable by filling in the database connection parameters. See Configuring Profiles for the details on the file format.
##Configuring the inputs Once all of the activities are added and the source and destination databases are configured, the final step before executing the harvester to to write the harvester configuration. The harvester configuration specifies where the profile configuration can be found for the input, output and scratch data-stores, along with the course to process and the location of the associated registration data. A sample harvester configuration can be found in: "conf/Harvester.xml." See Configuring Harvesters for the details on the harvester configuration and registration data file formats.
##Running the program To perform the extraction of the data associated with a course, execute the following command:
java -jar target/lib/edm.jar /path/to/harvester.xml
It will take a few minutes for the harvester to extract the data for the specified course from the source database and write it to the destination database. During this time the harvester will write some log messages to the console as it proceeds to each stage in the operation. The total running time of the harvester depends on the amount of data in the course to be processed and the performance of the database. An extraction of the data from a course with about a million records took approximately six and a half minutes. This extraction was executed on 3.3GHz core i5 with 8GB of RAM and an SSD. The source and destination databases were also hosted on this machine. Placing the databases on a different machine could change the running time and the harvester is I/O bound for many of its operations.
Due to the complexity and interconnected nature of the dataset, it is difficult to remove the data for a course from the research database. For this reason, it is recommended that a new course be extracted from the source database twice. During the first extraction the data should be written an empty, test, database. The course data should be written to the production database during the second run. Writing the course data to a test database first, allows it to be inspected and ensured to be correct before committing it to the production database.
#Logging SLF4J (with Log4j) is used throughout the code base to provide logging functionality. By default log messages of levels ERROR, WARN and INFO are logged to the console and the log file, while messages of levels DEBUG and TRACE are only sent to the log file. The log file is named "app.log" found in the "logs" directory. Logging is configured via the log4j configuration file, which is found in the program's resources. Currently, all DEBUG and TRACE level messages are suppressed. If necessary, they may be selectively or globally enabled. However, it should be noted that globally enabling trace level logging will probably produce hundreds of gigabytes of log data.