Skip to content

Latest commit

 

History

History
281 lines (204 loc) · 10.9 KB

specification.md

File metadata and controls

281 lines (204 loc) · 10.9 KB

Version 1.1 Final

ua-parser2 Specification

This document describes the specification on how a parser should implement the regexes.yaml file for correctly parsing User-Agent strings on basis of that file.

This specification shall help maintainers and contributors to correctly use the provided information within the regexes.yaml file for obtaining information from the different User-Agent strings. Furthermore this specification shall be the basis for discussions on evolving the projects and the needed parsing algorithms.

This document will not provide any information on how to implement the ua-parser2 project on your server and how to retrieve the User-Agent string for further processing.

Table of Contents

regexes.yaml

Any information which can be obtained from a User-Agent string may contain information on:

  • User-Agent aka "the browser"
  • Engine the Rendering Engine used by the User-Agent
  • OS (Operating System) the User-Agent currently uses (or runs on)
  • Device information by means of the physical device the User-Agent is using

This information is provided within the regexes.yaml file. Each kind of information requires a different parser which extracts the related type. These are:

  • user_agent_parsers
  • engine_parsers
  • os_parsers
  • device_parsers

Each parser contains a list of regular-expressions which are named regex. For each regex replacements specific to the parser can be named to attribute or change information. A replacement may require a match from the regular-expression which is extracted by an expression enclosed in normal brackets (.*). Each match can be addressed with $1 to $999 and used in a parser specific replacement.

General Rules for writing/changing Regular Expressions

Escape the following characters

  • . => \.
  • ( => \( ; ) => \) ; otherwise a regex group will be opened
  • / does not need to be escaped but can be escaped using \/
  • - used within Bracketed Character classes needs to be escaped. E.g. [ _\-]+

For further details on Regular Expressions check perldoc.

Rules for Replacements

To use matches within replacements use either the form of $\d+, e.g. $1 ... $999.
In special cases where you want to use match $1 followed by a 0 and not $10, use the form ${1}0

Example

- regex: '(Minefield)/(\d+)\.(\d+)\.(\d+(?:pre)?)'
  family: 'Firefox ($1)'
  v1: '$3'
  v2: '$4'
  patch:
  type: 'browser::Firefox::$1'

A User-Agent String Minefield/2.1.0pre using the regex above would be evaluated to:

family: 'Firefox (Minefield)'
major: '1'
minor: '0pre'
patch: 
type: 'browser::Firefox::Minefield'

Grouping regex'es

To speed up parsing as well as to build isolated groups of regular expressions, the group parameter is used. It shall contain a regex any may use regex_flag: 'i' for case-insensitive matching. The group is only entered, if the associated regex matches. Otherwise evaluation continues with the next following regex or group.

Example

#>> group matching '/gecko/i' starts
- group:
  regex: 'gecko'
  regex_flag: 'i'
  # can only be reached if '/gecko/i' did initially match
  - regex: '(Firefox)/(\d+)\.(\d+)\.(\d+(?:pre)?)'
  #<< group matching '/gecko/i' ends

- regex: '(Chrome)/(\d+)\.(\d+)'

In this case the group gecko is only entered if the User-Agent-String contains /gecko/i. regex: Firefox matches only if the User-Agent-String contains /gecko/i as well.

## user_agent_parsers

The user_agent_parsers shall return information of the family type of the User-Agent. If available the version infomation specifying the family may be extracted as well. Here major, minor and patch version information can be addressed or overwritten. A match is a regex-group enclosed in normal brackets.

match in regex default replacement note
1 family specifies the User-Agents family
2 v1 major version number/info of the family
3 v2 minor version number/info of the family
4 v3 patch version number/info of the family
- type may further describe the type of User-Agent, e.g. 'bot', 'app'

In case that no replacement is specified, the association is given by order of the match. If in the regex no first match (within normal brackets) is given, the family shall be specified!

To overwrite the respective value the replacement value needs to be named for a regex-item.

The type is only set in the output if a value is present for the respective regex. If you want to define "subgroups" use :: as seperator characters. <type>::<subtype>::<subsubtype>.

An infinite number of regex-groups shall be possible for each regex.

If regex_flag: 'i' is present then regex shall be evaluated as case-insensitive.

Parser Implementation:

The list of regular-expressions regex shall be evaluated for a given User-Agent string beginning with the first regex-item in the list to the last item. The first matching regex stops processing the list. Regex-matching shall be case sensitive.

In case that no replacement for a match is specified for a regex-item, the first match defines the family, the second major, the third minorand the forth patch information.

As placeholder for inserting matched characters $1 to $999 can be used to insert the matched characters from the regex into the replacement string.

If no matching regex is found the value for family shall be "Other". The version information major, minor and patch shall not be defined.

Example:

For the User-Agent: Mozilla/5.0 (Windows; Windows NT 5.1; rv:2.0b3pre) Gecko/20100727 Minefield/4.0.1pre the matching regex:

- regex: '(Namoroka|Shiretoko|Minefield)/(\d+)\.(\d+)\.(\d+(?:pre)?)'
  family: 'Firefox ($1)'

shall be resolved to:

family: Firefox (Minefield)
major : 4
minor : 0
patch : 1pre

The engine_parsers shall return information of the family type of the Engine the User-Agent uses. If available the version infomation specifying the family may be extracted as well. Here major, minor and patch version information can be addressed or overwritten.

match in regex default replacement note
1 family specifies the User-Agents Engine family
2 v1 major version number/info of the family
3 v2 minor version number/info of the family
4 v3 patch version number/info of the family
- type may further describe the type of Engine e.g. 'mode::MSIE 9'

The parser implemention shall be identical to the one described in user_agent_parsers.

os_parsers

The os_parsers shall return information of the os type of the Operating System (OS) the User-Agent runs. If available the version information specifying the os may be extracted as well if available. Here major, minor, patch and patchMinor version information can be addressed or overwritten.

match in regex default replacement note
1 family specifies the OS
2 v1 major version number/info of OS
3 v2 minor version number/info of the OS
4 v3 patch version number/info of the OS
5 v4 patchMinor version number/info of the OS
- type may further describe the type of OS

The parser implemention shall be identical to the one described in user_agent_parsers. Aditionally the patchMinor value shall always be added to the resulting object.

device_parsers

The device_parsers shall return information of the device family the User-Agent runs. Furthermore brand and model of the device can be specified. brand shall name the manufacturer or brand of the device, where model shall specify the model of the device.

match in regex default replacement note
1 family specifies the device family
- brand major version number/info of OS
1 model minor version number/info of the OS
- type may further describe the type of Device

In case that no replacement is specified the association is given by order of the match. If in the regex no first match (within normal brackets) is given the family together with the model shall be specified! To overwrite the respective value the replacement value needs to be named for a given regex.

The type is only set in the output if a value is present for the respective regex. If you want to define "subgroups" use :: as seperator characters. <type>::<subtype>::<subsubtype>.

An infinite number of regex-groups shall be possible for each regex.

If regex_flag: 'i' is present then regex shall be evaluated as case-insensitive.

Parser Implementation:

The list of regular-expressions regex shall be evaluated for a given User-Agent string beginning with the first regex-item in the list to the last item. The first matching regex stops processing the list. Regex-matching shall be case sensitive.

In case that no replacement for a match is given, the first match defines the family and the model.

As placeholder for inserting matched characters $1 to $999 can be used to insert the matched characters from the regex into the replacement string.

In case that no matching regex is found the value for family shall be "Other". brand and model shall not be defined.

Example:

For the User-Agent: Mozilla/5.0 (Linux; U; Android 4.2.2; de-de; PEDI_PLUS_W Build/JDQ39) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30 the matching regex:

- regex: '; (PEDI)_(PLUS)_(W) Build/'
  device: '$1_$2_$3'
  brand: 'Odys'
  model: '$1 $2 $3'

shall be resolved to:

family: 'PEDI_PLUS_W' 
brand: 'Odys'
model: 'PEDI PLUS W'

Output format

The current output format returns an object with the following structure:

{
  ua: { // result of the user_agent_parsers
    family: 
    major:
    minor:
    patch:
    type: // optional
  },
  engine: { // result of the engine_parsers
    family: 
    major:
    minor:
    patch:
    type: // optional
  }
  os: { // result of the os_parsers
    family: 
    major:
    minor:
    patch:
    patchMinor:
    type: // optional
  }
  device: { // result of the device_parsers
    family: 
    brand:
    model:
    type: // optional
  }
}

End of Document