Skip to content

Latest commit

 

History

History
464 lines (365 loc) · 15.7 KB

README.md

File metadata and controls

464 lines (365 loc) · 15.7 KB

Log and Visualise Data (Lavd)

Node.js Python

Lavd Screenshot

Tracking the progress of Machine Learning (ML) experiments is essential for deciding what model works well with the available data. Using some kind of metric, for example Intersection over Union (IoU) for object detection, gives a good idea of the quality of the model, but additionally, it is important to also see a particular example to get a better intuition. This, and more can be logged and visualised in an interactive manner, including:

  • Scalars
  • Images
  • Text
  • Log files
  • Markdown documents
  • CLI commands (how to run the experiment with the same parameters)

...and more to come.

Table of Contents

Getting Started

Installing with pip:

pip install --user --upgrade lavd

Once installed, the server can be started by running lavd with the directory containing the (logged) data:

lavd path/to/logs

Or if you prefer to run it as a module instead of using the executable (both versions do exactly the same):

python -m lavd.server path/to/logs

Logging Data

Lavd includes a logger that can be used to easily log all the desired data.

import lavd

# Logs everything to log/some-experiment-name/
# If no name is giving it the current date and time are used as a name
logger = lavd.Logger("some-experiment-name")

Scalars

Scalars will be plotted and unlike other categories, the have to be assigned to a step.

logger.log_scalar(0.8, "accuracy", step=1)
logger.log_scalar(0.6, "accuracy", step=2)
logger.log_scalar(0.3, "accuracy", step=3)

logger.log_scalar(0.1, "easy/accuracy", step=7)
logger.log_scalar(0.05, "easy/accuracy", step=14)

Images

Logging images with optional bounding boxes that for interactive visualisation.

 # Saves image to: log/some-experiment-name/0001/bird.png
logger.log_image(image, "bird", step=1)
 # Saves image to: log/some-experiment-name/0001/other/bird.png
logger.log_image(image, "other/bird", step=1)
# No step saves it to: log/some-experiment-name/other/bird.png
logger.log_image(image, "other/bird")

# With bounding boxes, each box is given by 2 points: top-left and bottom-right.
boxes = [
    {
        "xStart": 100,
        "yStart": 100,
        "xEnd": 150,
        "yEnd": 200,
        "className": "bird", # Optional
        "probability": 0.4,  # Optional
    },
    # Another bounding box, without class or probability
    {"xStart": 200, "yStart": 22, "xEnd": 233, "yEnd": 80,},
]
# List of available classes
classes = ["bird", "orange", "background"]
# Threshold to count boxes as valid and threfore visible
threshold = 0.2
logger.log_image(
    img,
    "birds-with-bounding-boxes",
    step=3,
    boxes=boxes,
    classes=class_names,
    threshold=threshold,
)

Text

Text can be logged with an additional expected text, which then shows a diff between the actual text and the expected, otherwise it's just the text.

logger.log_text("The quick brown fox...", "famous-sentence", step=1)
# With an expected text
logger.log_text("hello world", "with-diff", step=2, expected="Hallo Welt")

Markdown

Markdown files are given as Markdown string.

logger.log_markdown("# Hello\n\nMore markdown...", "some-markdown")

logger.log_markdown("# Step 1\\nn## Result\n\nGood", "for-step", step=1)

Command Line Options

Command line arguments that have been used to start the experiment and all available options for that script can be saved, if you use argparse to parse the command line options.

import argparse

parser = argparse.ArgumentParser()
# Add all options with: parser.add_argument()
# ...
args = parser.parse_args()

logger.log_command(parser, args)

Summary

A summary of the current experiment, including current git commit, diff of the working tree and possible more information, can be saved as a Markdown document. There are two parts of the summary that can be configured, short infos that are presented is an unordered list given as a dictionary, where the key is the label (can be nested for nested lists), and additional sections at the end of the document given as a dictionary where the key is the section title and the value is either the raw string of the content or a list of lines.

# Regular summary
logger.log_summary()


infos = {"size": {"train": 1000, "validation": 50}}
sections = {
    "Additional raw Information": "Some text\nand more",
    "Same but with Lines": ["Some text", "and more"],
}
logger.log_summary(infos, sections)

Printing

Print functions act as print() combined with "".format() but also log all the messages into a log file in addition to printing.

#               Message              ...Format arguments
logger.println("{}, message: {num}", "Hello", num=1)
# => Hello, message: 1

# Same but to STDERR instead of STDOUT
logger.eprintln("{}, message: {num}", "Hello", num=1)

Nicely formatted tables with print_table:

header = ["Name", "Correct", "Total", "Accuracy"]
rows = [
    ["Train", 978, 1000, 0.978],
    ["Validation", 90, 100, 0.9],
    ["Test", None, 50, None],
]
logger.print_table(header, rows)
# | Name       | Correct | Total | Accuracy |
# |------------|---------|-------|----------|
# | Train      | 978     | 1000  | 0.97800  |
# | Validation | 90      | 100   | 0.90000  |
# | Test       | -       | 50    | -        |

Spinner

A small wrapper around Halo, that also logs start and end of the duration to a log file. Requires a name as the first argument, which will be used in the log file and as the text next to the spinner, unless the text argument is supplied. All additional arguments are forwarded to Halo.

with logger.spinner("Loading..."):
    # Do something that takes a while

# Without context manager, but also some arguments for Halo
spinner = logger.spinner("Manual spinner", spinner="smiley", placement="right")
spinner.start()
# Do something
spinner.stop()

Progress Bar

A small wrapper around tqdm, that also logs start and end of the duration to a log file. Requires a name as the first argument, which will be used in the log file and as the text next to the progress bar, unless the desc argument is supplied. All additional arguments are forwarded to tqdm.

for d in logger.progress_bar("Train", data):
    # Trainings loop

# With context manager and manual update
with logger.progress_bar("Manual progress bar", total=100) as pbar:
    while True:
      # Do something

      # Update the progress bar by 4 steps
      pbar.update(4)

Saving a PyTorch Model

PyTorch is optional, but must be installed to use this feature.

Saves the state dict of a PyTorch model and optionally its current gradients. If the model is a Multi-GPU/Node model (i.e. nn.DataParallel or nn.parallel.DistributedDataParallel) it will automatically be unwrapped to be compatible with the initial model.

model = ResNet101()
logger.save_model(model, step=3)

distributed_model = nn.parallel.DistributedDataParallel(
    model, device_ids=[gpu_id], find_unused_parameters=True
)
# Distributed model and also saving the gradients
logger.save_model(distributed_model, step=4, grads=True)

Saving any Object

PyTorch is optional, but must be installed to use this feature.

Saves any object by serialising it with torch.save.

logger.save_obj({ "key": 99 }, "a-dict")

logger.save_obj((1, 2), "some_tuple", step=4)

Disabling the Logger

All logging actions can be disabled, either creating a disabled logger or by later disabling it. The logging actions will be a no-op instead and all methods can be used as if they were a regular logger but without having any output. Methods that produce some useful information without logging anything continue to function as usual.

Disabling the logger is particularly useful when the same script is launched multiple in multiple processes, but only the main process should create the logs.

# Create a disabled logger
# No log directory will be created
logger = lavd.Logger("some-experiment-name", disabled=True)

# Does nothing
logger.println("hello")

# Won't show the spinner
with logger.spinner("Loading..."):
    # Do something that takes a while

# Create a normal logger
logger = lavd.Logger("some-experiment-name")

# Disable it later
logger.disable()

# Enable it (also works if the logger was initially disabled)
logger.enable()

Data Layout

The server picks up any data that is present in the specified log directory that can be displayed, but certain functionalities require a specific layout or file format (but still human readable).

In the root of log directory (i.e. path given to lavd) the different experiments are listed by name. Each experiment is in its own directory. Any other file at the top level will be ignored, since it cannot be associated to one experiment.

Within each experiment directory, there are numbered directories, which acts as a step (epoch) within the experiment, and all files within that directory belongs to that specific step. They are usually padded with zeros e.g. 0001, but that is not required, it only helps to show them in the correct order for tools, such as ls, that shows them in lexicographical order. Since 0001 == 01 == 1, multiple representations of the same numbers will be in conflict with each other, and should therefore be avoided.

Every other file, including nested files, are used as data based on their file type and the relative path is used as a category label (minus the extension). For example, the some-experiment-name/0001/bird.png is an image belonging to the some-experiment-name experiment, of step 1 and with the label bird. The same label can contain multiple data, let's say there is also a bird.md file, so there would be a Markdown document with the label bird. The same base name can therefore be used for different files, as long as they are in a different category, this means that just using a different extension will create a conflict (e.g. birds.png and birds.jpeg) and should also be avoided.

Files not within a step directory are considered global to the experiment (including nested files) and once in a nested directory, numbered directories are just regular directories, not specific steps, for example some-experiment-name/other/0001/still-global.md is a global to some-experiment-name with the label other/0001/still-global.

There is one exception to the rule above, and that's for commands. Only command is used per experiment, and it's the commands.json file, directly inside the experiment directory, i.e. some-experiment-name/commands.json. Any other file containing the commands, will not be displayed (just the command, all other categories are still shown even if they are in a nested command.json)

JSON files are special in the sense that they can contain multiple kinds of data. They are separated within the JSON file by using the appropriate keys.

The following directory structure is demonstrates the key ideas:

log/
├── some-experiment-name
│   ├── 0001                        # Step: 1
│   │   ├── bird.png                # Label: bird
│   │   ├── bird.json               # Label: bird
│   │   ├── bird.md                 # Label: bird
│   │   └── difficult
│   │       └── unlabelled
│   │           └── some image.png  # Label: difficult/unlabelled/some image
│   ├── 0002                        # Step 2
│   │   └── bird.png
│   ├── command.json                # Only command that can contain command definitions
│   ├── events.log
│   ├── other                       # Nested global
│   │   ├── 0001                    # Not a step directory!
│   │   │   └── still-global.md     # Label: other/0001/still-global
│   │   └── instructions.md         # Label: other/instructions
│   └── summary.md
├── yet-another-experiment
│   └── 0007
│       └── bird.png
├── top-level-file.md               # ignored
│
...

Comparison to TensorBoard

TensorBoard is one of the most used tools for this purpose and is undoubtedly incredibly useful. Unfortunately, it also has its drawbacks, hence Lavd was created with a different set of goals. One of the biggest annoyances from of TensorBoard is the fact that everything is serialised, which means that you cannot inspect the data directly without deserialising it first. If sufficiently large amount of data is logged, it takes a while to see the last event, because everything is deserialised in order. Additionally, something like images cannot just be viewed with any other tool, since they are not saved as image files. TensorBoard also limits the number of steps that can be viewed for something for images or text, presumably to not blow up the serialised data, and you cannot see anything in between or even decide which ones you want to keep.

Lavd leverages the file system and the data can be freely viewed and edited just like any other file, while still being picked up for the visualisation. That also comes with its own drawbacks compared to serialising the data, for example using more disk space, although that is almost certainly a non-issue.

Another difference to TensorBoard is that it does not revolve around TensorFlow graphs or tensors or Machine Learning in general, but focuses more on the actual data. Even though TensorBoard has been used for a while with PyTorch, the graphs are not really the focus of that, but it's mostly used to log the scalars, text, images etc. While the graph visualisation are nice, they are not used all that often, because once the graph has been seen there is just no reason to create it again, but there are countless experiments that use the exact same graph. This also means that Lavd does not try to cover everything that TensorBoard does. Of course there are also features that TensorBoard does not cover.

Known Issues

Inotify watch limit reached

OSError: [Errno 28] inotify watch limit reached

The error occurs if there are too many files that are being watch at the same time. You can check how many watches are allowed as follows:

cat /proc/sys/fs/inotify/max_user_watches
# => 8192

The default is 8192 on most systems. To solve the issue the limit needs to be increased:

sudo sysctl fs.inotify.max_user_watches=524288