Skip to content

Feature/wikidata importer #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,25 +23,24 @@
`apt-get install python3-yaml python3-requests python3-click python3-distro python3-psutil python3-pexpect python3-pyftpdlib python3-statsd python3-selenium python3-pip gdb`

the `python3-semver` on debian is to old - need to use the pip version instead:
`pip3 install semver beautifultable allure_python_commons certifi tabulate`
`pip3 install semver beautifultable allure_python_commons certifi csv tabulate`

Ubuntu 16.40 pip3 system package is broken. Fix like this:
`dpkg -r python3-pip python3-pexpect`
`python3.8 -m easy_install pip`
`pip install distro semver pexpect psutil beautifultable allure_python_commons certifi`

`pip install distro semver pexpect psutil beautifultable tabulate allure_python_commons certifi csv`
- **centos**:
`yum update ; yum install python3 python3-pyyaml python36-PyYAML python3-requests python3-click gcc platform-python-devel python3-distro python3-devel python36-distro python36-click python36-pexpect python3-pexpect python3-pyftpdlib; pip3 install psutil semver beautifultable`
`yum update ; yum install python3 python3-pyyaml python36-PyYAML python3-requests python3-click gcc platform-python-devel python3-distro python3-devel python36-distro python36-click python36-pexpect python3-pexpect python3-pyftpdlib; pip3 install psutil semver beautifultable tabulate allure_python_commons certifi csv`
`sudo yum install gdb`
- **plain pip**:
`pip3 install psutil pyyaml pexpect requests click semver ftplib selenium beautifultable tabulate allure_python_commons certifi`
`pip3 install psutil pyyaml pexpect requests click semver ftplib selenium beautifultable tabulate allure_python_commons certifi csv`
or:
`pip install -r requirements.txt`

## Mac OS
:
`brew install gnu-tar`
`pip3 install click psutil requests pyyaml semver pexpect selenium beautifultable tabulate allure_python_commons certifi`
`pip3 install click psutil requests pyyaml semver pexpect selenium beautifultable markdown allure_python_commons tabulate allure_python_commons certifi csv`
`brew install gdb`
if `python --version` is below 3.9 you also have to download ftplib:
`pip3 install click ftplib`
Expand Down
2 changes: 1 addition & 1 deletion containers/docker_deb/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ RUN if [ -z "$CHROME_DRIVER_VERSION" ]; \
&& sudo ln -fs /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION /usr/bin/chromedriver


RUN pip3 install semver selenium beautifultable allure_python_commons mss tabulate
RUN pip3 install semver selenium beautifultable allure_python_commons mss tabulate csv

run mkdir -p /home/release-test-automation \
/home/package_cache \
Expand Down
4 changes: 2 additions & 2 deletions containers/docker_rpm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ run mkdir -p /home/release-test-automation \
/home/entrypoint
RUN mkdir -p /home/entrypoint

RUN yum -y update; yum install -y python3 python3-pyyaml python36-PyYAML python3-requests python3-click gcc platform-python-devel python3-distro python3-devel python36-distro python36-click python36-pexpect python3-pexpect python3-pyftpdlib initscripts file gdb chromedriver chromium python3-markdown;
RUN pip3 install selenium psutil semver click requests pyyaml distro pexpect beautifultable allure_python_commons tabulate certifi mss
RUN yum -y update; yum install -y python3 python3-pyyaml python36-PyYAML python3-requests python3-click gcc platform-python-devel python3-distro python3-devel python36-distro python36-click python36-pexpect python3-pexpect python3-pyftpdlib initscripts file gdb chromedriver chromium;
RUN pip3 install selenium psutil semver click requests pyyaml distro pexpect beautifultable allure_python_commons tabulate certifi mss csv

RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in ; do [ $i == systemd-tmpfiles-setup.service ] || rm -f $i; done); \
rm -rf /lib/systemd/system/multi-user.target.wants/;\
Expand Down
2 changes: 1 addition & 1 deletion containers/docker_tar/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ RUN if [ -z "$CHROME_DRIVER_VERSION" ]; \
#VOLUME ["/sys/fs/cgroup"]
# STOPSIGNAL SIGRTMIN+3

RUN pip3 install semver beautifultable allure_python_commons mss tabulate
RUN pip3 install semver beautifultable allure_python_commons mss tabulate csv
RUN mkdir -p /home/entrypoint /home/release-test-automation /home/package_cache /home/versions /home/test_dir
# ADD tarball_nightly_test.py /home/entrypoint/tarball_nightly_test.py
# ENTRYPOINT ["/home/entrypoint/tarball_nightly_test.py"]
2 changes: 1 addition & 1 deletion containers/this_version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.2
1.4
37 changes: 27 additions & 10 deletions release_tester/arangodb/async_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ def convert_result(result_array):
result += "\n" + one_line[0].decode("utf-8").rstrip()
return result

def custom_writer(ArangoCLIprogressiveTimeoutExecutorInstance, writer):
writer(ArangoCLIprogressiveTimeoutExecutorInstance)

class CliExecutionException(Exception):
"""transport CLI error texts"""
Expand All @@ -70,6 +72,7 @@ def __init__(self, config, connect_instance):
"""launcher class for cli tools"""
self.connect_instance = connect_instance
self.cfg = config
self.process = None

def run_arango_tool_monitored(
self,
Expand All @@ -79,6 +82,7 @@ def run_arango_tool_monitored(
result_line,
verbose,
expect_to_fail=False,
writer=None
):
"""
runs a script in background tracing with
Expand All @@ -93,18 +97,21 @@ def run_arango_tool_monitored(
"--server.username", str(self.cfg.username),
"--server.password", str(self.connect_instance.get_passvoid())
] + more_args
return self.run_monitored(executeable, run_cmd, timeout, result_line, verbose, expect_to_fail)
return self.run_monitored(executeable, run_cmd, timeout, result_line, verbose, expect_to_fail, writer=writer)
# fmt: on

def run_monitored(self, executeable, args, timeout, result_line, verbose, expect_to_fail=False):
def run_monitored(self, executeable, args, timeout, result_line, verbose, expect_to_fail=False, writer=None):
"""
run a script in background tracing with a dynamic timeout that its got output (is still alive...)
"""

write_pipe = None
if writer is not None:
write_pipe = PIPE
run_cmd = [executeable] + args
lh.log_cmd(run_cmd, verbose)
process = Popen(
self.process = Popen(
run_cmd,
stdin=write_pipe,
stdout=PIPE,
stderr=PIPE,
close_fds=ON_POSIX,
Expand All @@ -114,23 +121,31 @@ def run_monitored(self, executeable, args, timeout, result_line, verbose, expect
thread1 = Thread(
name="readIO",
target=enqueue_stdout,
args=(process.stdout, queue, self.connect_instance),
args=(self.process.stdout, queue, self.connect_instance),
)
thread2 = Thread(
name="readErrIO",
target=enqueue_stderr,
args=(process.stderr, queue, self.connect_instance),
args=(self.process.stderr, queue, self.connect_instance),
)
thread1.start()
thread2.start()
thread3 = None
if writer is not None:
thread3 = Thread(
name="WriteIO",
target=custom_writer,
args=(self, writer),
)
thread3.start()

try:
print(
"me PID:%d launched PID:%d with LWPID:%d and LWPID:%d"
% (os.getpid(), process.pid, thread1.native_id, thread2.native_id)
% (os.getpid(), self.process.pid, thread1.native_id, thread2.native_id)
)
except AttributeError:
print("me PID:%d launched PID:%d with LWPID:N/A and LWPID:N/A" % (os.getpid(), process.pid))
print("me PID:%d launched PID:%d with LWPID:N/A and LWPID:N/A" % (os.getpid(), self.process.pid))

# ... do other things here
# out = logfile.open('wb')
Expand Down Expand Up @@ -169,10 +184,12 @@ def run_monitored(self, executeable, args, timeout, result_line, verbose, expect
timeout_str = "TIMEOUT OCCURED!"
print(timeout_str)
timeout_str += "\n"
process.kill()
rc_exit = process.wait()
self.process.kill()
rc_exit = self.process.wait()
thread1.join()
thread2.join()
if writer:
thread3.join()

if have_timeout or rc_exit != 0:
res = (False, timeout_str + convert_result(result), rc_exit, line_filter)
Expand Down
84 changes: 81 additions & 3 deletions release_tester/arangodb/imp.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@
""" Run a javascript command by spawning an arangosh
to the configured connection """

import json
import csv
import ctypes

from arangodb.async_client import ArangoCLIprogressiveTimeoutExecutor, dummy_line_result
from tools.asciiprint import print_progress as progress


def get_type_args(filename):
Expand All @@ -13,15 +18,51 @@ def get_type_args(filename):
return ["--type=json"]
if str(filename).endswith("csv"):
return ["--type=csv"]
if filename == "-":
return ["--type=jsonl"]
raise NotImplementedError("no filename type encoding implemented for " + filename)

month_decode = {
"JAN":"01",
"FEB":"02",
"MAR":"03",
"APR":"04",
"MAY":"05",
"JUN":"06",
"JUL":"07",
"AUG":"08",
"SEP":"09",
"OCT":"10",
"NOV":"11",
"DEC":"12"
}

def decode_date(date):
"""convert date to something more arango'ish"""
if len(date) == 24:
month = date[3:6]
day = date[0:2]
year = date[7:11]
time = date[12:24]
year += "-"
year += month_decode.get(month, "01")
year += "-"
year += day
year += "T"
year += time
return year
return date

class ArangoImportExecutor(ArangoCLIprogressiveTimeoutExecutor):
"""configuration"""

# pylint: disable=W0102
def __init__(self, config, connect_instance):
super().__init__(config, connect_instance)
self.wikidata_reader = None
self.wikidata_nlines = 0

def run_import_monitored(self, args, timeout, verbose=True, expect_to_fail=False):
def run_import_monitored(self, args, timeout, verbose=True, expect_to_fail=False, writer=None):
# pylint: disable=R0913 disable=R0902 disable=R0915 disable=R0912 disable=R0914
"""
runs an import in background tracing with
Expand All @@ -40,9 +81,10 @@ def run_import_monitored(self, args, timeout, verbose=True, expect_to_fail=False
dummy_line_result,
verbose,
expect_to_fail,
writer=writer
)

def import_collection(self, collection_name, filename, more_args=[]):
def import_collection(self, collection_name, filename, more_args=[], writer=None):
"""import into any collection"""
# fmt: off
args = [
Expand All @@ -51,7 +93,7 @@ def import_collection(self, collection_name, filename, more_args=[]):
] + get_type_args(filename) + more_args
# fmt: on

ret = self.run_import_monitored(args, timeout=20, verbose=self.cfg.verbose)
ret = self.run_import_monitored(args, timeout=20, verbose=self.cfg.verbose, writer=writer)
return ret

def import_smart_edge_collection(self, collection_name, filename, edge_relations, more_args=[]):
Expand All @@ -67,3 +109,39 @@ def import_smart_edge_collection(self, collection_name, filename, edge_relations

ret = self.import_collection(collection_name, filename, more_args=args)
return ret

def wikidata_writer(self):
"""pipe wikidata file into improter while translating it"""
count = 0
for row in self.wikidata_reader:
count += 1
if count > self.wikidata_nlines:
print("imported enough, aborting.")
break
if count > 1: # headline, we don't care...
line = json.dumps({
'title': row[0],
'body': row[2],
'count': count,
'created':decode_date(row[1])}) + "\n"
# print(line)
progress("I")
self.process.stdin.write(line.encode())
self.process.stdin.close()

def import_wikidata(self, collection_name, nlines, filename, more_args=[]):
"""import by write piping"""
filedes = filename.open("r", encoding='utf-8', errors='replace')
self.wikidata_reader = csv.reader(filedes, delimiter='\t')
self.wikidata_nlines = nlines
# Override csv default 128k field size
csv.field_size_limit(int(ctypes.c_ulong(-1).value // 2))

# args = get_type_args('foo.json') + more_args
args = ['--create-collection', 'true' ] + more_args
ret = self.import_collection(
collection_name,
filename="-",
more_args=args,
writer=ArangoImportExecutor.wikidata_writer)
return ret
2 changes: 2 additions & 0 deletions release_tester/arangodb/starter/deployments/activefailover.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,8 @@ def test_setup_impl(self):
if self.selenium:
self.set_selenium_instances()
self.selenium.test_setup()
self.wikidata_import_impl()
self.execute_views_tests_impl()

def wait_for_restore_impl(self, backup_starter):
backup_starter.wait_for_restore()
Expand Down
2 changes: 2 additions & 0 deletions release_tester/arangodb/starter/deployments/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ def finish_setup_impl(self):
def test_setup_impl(self):
if self.selenium:
self.selenium.test_setup()
self.wikidata_import_impl()
self.execute_views_tests_impl()

def wait_for_restore_impl(self, backup_starter):
for starter in self.starter_instances:
Expand Down
3 changes: 3 additions & 0 deletions release_tester/arangodb/starter/deployments/dc2dc.py
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,9 @@ def test_setup_impl(self):
print(res[1])
raise Exception("replication fuzzing test failed")
self._get_in_sync(12)
self.wikidata_import_impl()
self.execute_views_tests_impl()
self._get_in_sync(12)

def wait_for_restore_impl(self, backup_starter):
for dbserver in self.cluster1["instance"].get_dbservers():
Expand Down
10 changes: 6 additions & 4 deletions release_tester/arangodb/starter/deployments/leaderfollower.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
#!/usr/bin/env python
""" launch and manage an arango deployment using the starter"""
import time
import os
import logging
from pathlib import Path

from tools.interact import prompt_user
from tools.killall import get_all_processes
from arangodb.async_client import dummy_line_result
from arangodb.starter.manager import StarterManager
from arangodb.instance import InstanceType
from arangodb.starter.deployments.runner import Runner, RunnerProperties
import tools.loghelper as lh
from tools.interact import prompt_user
from tools.killall import get_all_processes
from tools.asciiprint import print_progress as progress

from reporting.reporting_utils import step


class LeaderFollower(Runner):
"""this runs a leader / Follower setup with synchronisation"""

Expand Down Expand Up @@ -224,7 +225,8 @@ def test_setup_impl(self):
self.make_data()
if self.selenium:
self.selenium.test_setup()

self.wikidata_import_impl()
self.execute_views_tests_impl()
logging.info("Leader follower setup successfully finished!")

@step
Expand Down
Loading