Have a ConsolePolicy which

Possible fields for a log message:

LEVEL: info, warn, error
MESSAGE:
PREFIX:
RAW_ERROR: Only see this if LEVEL=error
SUBJECT: publish, publisherror

Inplace message filter

Header

[ ] Check for header constraints in-place to avoid memory allocation.
[ ] Possible constraints
- Timestamp
- Node name
- Pool name

How to make thing fast?

Run the first time to get the size of lines.
Second run will be used to filter messages?

File

size: Size of a file.
lines: Position of EOL.
data: A file content.

FileStats

path: Full path
stem: Stem
extension: File extension
mask: Read/Write/Execute etc
symlink: Is a symlink or not
is_dir: Is a folder or not

Tasks

Loop thought all log files and find all possible field name.

Extract JSON string.
Parse JSON string and update possible key list.

Data structure for scribe log

LogMessage

MessageHeader

MessageBody

Multi-threaded log parser.

Producer

Read trunk of data
Detect the last EOL and pass data to the consumer. Might need to push data into a queue?

Consumers

Pop data off queue.

Parse data and keep informat that we need.
Update the results (shared data)

Store data

Store data into NoSQL database.
Column: date_hour
Key: Message IDs.
Need to have a cache information for message ids.

Check message life cycle

Parse incomming messages.
Update message lifecycle table. A vector + a hash table.
Print out summary information.
Print out outlier.

What we can do to speedup fastgrep?

Parsing algorithm: Can we parse the read buffer instead of copying to the line buffer?

Handle cases that a line can be on many block.

Scan log file to cache message information such as

begin and end of a message.
Type of messages?
Timestamp

How do we cache messages

publish request.

Can be varied from 131 to several MBytes.

Control messages

We only store the message id (20 bytes) and message type (1 bytes)

Error messages

User error
RAW_ERROR: connect and publish errors etc.

Use cases

Support skip patterns

Filter message by time stamp.

Save found message to files in different format.

Check message life cycle.

Clustering messages

Questions

Can mixed-in offer better performance than policy based approach?