Skip to content

feat: backreferences, named capture groups, named backreferences #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion babel.config.js
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module.exports = {
presets: ['@babel/preset-env', '@babel/preset-typescript'],
presets: [['@babel/preset-env', { targets: { node: '10.0' } }], '@babel/preset-typescript'],
};
35 changes: 35 additions & 0 deletions docs/API.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,41 @@ Captures, also known as capturing groups, extract and store parts of the matched
> [!NOTE]
> TS Regex Builder does not have a construct for non-capturing groups. Such groups are implicitly added when required. E.g., `zeroOrMore(["abc"])` is encoded as `(?:abc)+`.

### `backreference()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ordinal backreferences accurately might be problematic in case of more complex expressions, nesting, etc. Therefore, I think we can drop them without loosing any functionality to the user trying to build maintainable regexes.


```ts
function backreference(
groupNumber: number
): Backreference
```

Regex syntax: `\1 \2`.

A backreference is a way to match the same text as previously matched by a capturing group.

### `namedCapture()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've considered option of having a separate namedCapture() construct in addtion to basic capture(). After some prototyping and consulting I've found capture(..., { name: 'aaa' }) to be better due to improving discoverability, and following JS convetion of "config" or "options" objects.


```ts
function namedCapture(
sequence: RegexSequence
): NamedCapture
```

Regex syntax: `(...)`.

A named capturing group is a capturing group that give a name to the group. The group's matching result can later be identified by this name.

### `namedBackreference()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swift Regex Builder uses following convention for named captures/references:

let kind = Reference(Substring.self)

let regex = Capture(as: kind) {
  ChoiceOf {
    "CREDIT"
    "DEBIT"
  }
}

see: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md#reference

It has a nice feature of connecting reference straight to capturing group, instead of forcing user to repeat the name twice, once in capture then in backreference.

However, in such case dropping "back" prefix seems beneficial, as reference becomes "backreference" only when added to regular expression. Until it's applied to the previous part of the express ("back"), is more of reference.


```ts
function namedBackreference(
groupName: string
): NamedBackreference
```

Regex syntax: `\k{groupName}`.

A named backreference is a way to match the same text as previously matched by a named capturing group.
### `lookahead()`

```ts
Expand Down
18 changes: 18 additions & 0 deletions src/__tests__/example-date.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import { buildRegExp, digit, endOfString, namedCapture, repeat, startOfString } from '..';

// Example: dateRegex
const dateRegex = /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one, I'll add this example.

const yearRegex = namedCapture(repeat(digit, 4), 'year');
const monthRegex = namedCapture(repeat(digit, 2), 'month');
const dayRegex = namedCapture(repeat(digit, 2), 'day');
const regex = buildRegExp([startOfString, yearRegex, '-', monthRegex, '-', dayRegex, endOfString], {
ignoreCase: true,
});

test('dateRegex', () => {
expect(dateRegex).toEqual(regex);
});

test('dateRegex matching', () => {
expect(dateRegex).toMatchGroups('2021-08-24', ['2021-08-24', '2021', '08', '24']);
});
90 changes: 90 additions & 0 deletions src/__tests__/example-email-advanced.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import {
anyOf,
buildRegExp,
charClass,
charRange,
digit,
endOfString,
namedCapture,
oneOrMore,
repeat,
startOfString,
} from '..';

//
// Example: email validation building blocks
//
const upperCase = charRange('A', 'Z');
const lowerCase = charRange('a', 'z');
const specialChars = anyOf("!#$%&'*+/=?^_`{|}~-");
const usernameChars = charClass(upperCase, lowerCase, digit, specialChars);
const hostnameChars = charClass(upperCase, lowerCase, digit, specialChars);
const domainChars = charRange('a', 'z');
const emailSeparator = anyOf('.');
const domainSeparator = anyOf('@');

//
// Example: email validation majour components using named capture.
//
const username = namedCapture(oneOrMore(usernameChars), 'username');

const usernameRegex = buildRegExp([startOfString, username, endOfString]);

test('Matching the Username component.', () => {
Copy link
Member

@mdjastrzebski mdjastrzebski Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this with existing email example, as there quite similar.

Note: I am planning to add some frequently used patterns (URL, email, maybe hashtags, etc). So that each user does not have to define them by hand. I will soon spec-out this feature. I invite you to join in if you have capacity to work for that.

expect(usernameRegex).toMatchString('john1234');
expect(usernameRegex).toMatchString('ringo$1234');
expect(usernameRegex).not.toMatchString('john@1234');
expect(usernameRegex).not.toMatchString('george.harrison');
expect(usernameRegex).not.toMatchString('paul.mccartney&wings');
expect(usernameRegex).not.toMatchString('ringo starr');
});

const hostname = namedCapture(oneOrMore(hostnameChars), 'hostname');

const hostnameRegex = buildRegExp([startOfString, hostname, endOfString]);

test('Matching the Hostname component.', () => {
expect(hostnameRegex).toMatchString('gmail');
expect(hostnameRegex).toMatchString('google');
expect(hostnameRegex).toMatchString('g-mail');
expect(hostnameRegex).toMatchString('g_mail');
expect(hostnameRegex).not.toMatchString('g mail');
expect(hostnameRegex).not.toMatchString('g.mail');
});

const domain = namedCapture(repeat(domainChars, { min: 2 }), 'domain');

const domainRegex = buildRegExp([startOfString, domain, endOfString]);

test('Matching the Domain component.', () => {
expect(domainRegex).toMatchString('com');
expect(domainRegex).toMatchString('org');
expect(domainRegex).not.toMatchString('c');
expect(domainRegex).not.toMatchString('o');
expect(domainRegex).toMatchString('co');
});

test('example: email validation', () => {
const regex = buildRegExp(
[startOfString, username, domainSeparator, hostname, emailSeparator, domain, endOfString],
{ ignoreCase: true },
);

expect(regex).toMatchString('[email protected]');
expect(regex).toMatchString('[email protected]');
expect(regex).toMatchString('[email protected]');
expect(regex).not.toMatchString('[email protected]');

expect(regex).not.toMatchString('@');
expect(regex).not.toMatchString('aaa@');
expect(regex).not.toMatchString('[email protected]');
expect(regex).not.toMatchString('@gmail.com');

const emailAddress = '[email protected]';
const match = regex.exec(emailAddress);
expect(match).not.toBeNull();
expect(match?.groups).not.toBeNull();
expect(match?.groups?.username).toBe('abba');
expect(match?.groups?.hostname).toBe('gold');
expect(match?.groups?.domain).toBe('com');
});
8 changes: 4 additions & 4 deletions src/__tests__/example-url-advanced.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ const uppercase = charRange('A', 'Z');
const hyphen = anyOf('-');
const alphabetical = charClass(lowercase, uppercase);
const specialChars = anyOf('._%+-');
const portSeperator = ':';
const schemeSeperator = ':';
const portSeparator = ':';
const schemeSeparator = ':';
const doubleSlash = '//';
const at = '@';
const pathSeparator = '/';
Expand Down Expand Up @@ -63,7 +63,7 @@ const userInfo = oneOrMore(usernameChars);
const hostname = repeat(hostnameChars, { min: 1, max: 63 });
const hostnameEnd = capture([hostname, endOfString]);
const host = capture([oneOrMore([hostname, '.'])]);
const port = [portSeperator, oneOrMore(digit)];
const port = [portSeparator, oneOrMore(digit)];

const authority = [doubleSlash, optional([userInfo, at]), hostname, optional(port)];
const authorityRegex = buildRegExp([startOfString, ...authority, endOfString], {
Expand Down Expand Up @@ -162,7 +162,7 @@ const urlRegex = buildRegExp(
startOfString,
capture([
optional(scheme),
schemeSeperator,
schemeSeparator,
optional(authority),
path,
optional(query),
Expand Down
75 changes: 75 additions & 0 deletions src/constructs/__tests__/backreference.test.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import { backreference, buildRegExp, capture } from '../..';

describe('backreference function', () => {
it('should create a backreference to a previously captured group', () => {
const group = capture('a');
const backRef = backreference(1);
const groupRegex = buildRegExp([group, backRef]);

const match = groupRegex.exec('aa');
expect(match).not.toBeNull();
expect(match?.[0]).toBe('aa');
});

it('should not match when the backreference does not match the captured group', () => {
const group = capture('a');
const backRef = backreference(1);
const groupRegex = buildRegExp([group, backRef]);

const match = groupRegex.exec('a\\1');
expect(match).toBeNull();
});

it('should allow references in multiple backreferences', () => {
const group1 = capture('a');
const group2 = capture('b');
const backRef1 = backreference(1);
const backRef2 = backreference(2);
const groupRegex = buildRegExp([group1, group2, backRef1, backRef2]);
const match = groupRegex.exec('aabb');
expect(match).toBeNull();
});

it('should handle multiple valid backreferences', () => {
const group1 = capture('a');
const group2 = capture('b');
const backRef1 = backreference(1);
const backRef2 = backreference(2);
const groupRegex = buildRegExp([group1, group2, backRef1, backRef2]);

const match = groupRegex.exec('aabb');
expect(match).toBeNull();
});
});

it('should handle backreferences in different order', () => {
const group1 = capture('a');
const group2 = capture('b');
const backRef1 = backreference(2);
const backRef2 = backreference(1);
const groupRegex = buildRegExp([group1, group2, backRef1, backRef2]);

const match = groupRegex.exec('abba');
expect(match).not.toBeNull();
expect(match?.[0]).toBe('abba');
});

it('should not match when the backreference does not match the captured group', () => {
const group = capture('a');
const backRef = backreference(1);
const groupRegex = buildRegExp([group, backRef]);

const match = groupRegex.exec('abba');
expect(match).toBeNull();
});

it('should handle multiple backreferences to the same group', () => {
const group1 = capture('a');
const backRef1 = backreference(1);
const backRef2 = backreference(1);
const groupRegex = buildRegExp([group1, backRef1, backRef2]);

const match = groupRegex.exec('aaa');
expect(match).not.toBeNull();
expect(match?.[0]).toBe('aaa');
});
43 changes: 43 additions & 0 deletions src/constructs/__tests__/named-backreference.test.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import { buildRegExp, namedBackreference, namedCapture } from '../..';

describe('named-backreference function', () => {
it('should create a backreference to a previously captured group', () => {
const group = namedCapture('a', 'groupA');
const groupRef = namedBackreference('groupA');
const groupRegex = buildRegExp([group, groupRef]);

const match = groupRegex.exec('aa');
expect(match).not.toBeNull();
expect(match?.[0]).toBe('aa');
});

it('should not match when the backreference does not match the captured group', () => {
const group = namedCapture('a', 'groupA');
const groupRef = namedBackreference('groupA');
const groupRegex = buildRegExp([group, groupRef]);

const match = groupRegex.exec('a\\1');
expect(match).toBeNull();
});

it('should allow references in multiple backreferences', () => {
const group1 = namedCapture('a', 'groupA');
const group2 = namedCapture('b', 'groupB');
const groupARef = namedBackreference('groupA');
const groupBRef = namedBackreference('groupB');
const groupRegex = buildRegExp([group1, group2, groupARef, groupBRef]);

const match = groupRegex.exec('aabb');
expect(match).toBeNull();
});

it('should handle multiple valid backreferences', () => {
const group1 = namedCapture('ab', 'groupA');
const group2 = namedCapture('ba', 'groupB');
const groupARef = namedBackreference('groupA');
const groupBRef = namedBackreference('groupB');
const groupRegex = buildRegExp([group1, group2, groupARef, groupBRef]);
const match = groupRegex.exec('abbaabba');
expect(match).not.toBeNull();
});
});
49 changes: 49 additions & 0 deletions src/constructs/__tests__/named-capture.test.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import { buildRegExp, namedCapture, oneOrMore } from '../..';

describe('namedCapture function', () => {
it('should create a named capture group', () => {
const regex = buildRegExp(namedCapture('a', 'group1'));
const match = regex.exec('a');
expect(match?.groups?.group1).toBe('a');
});

it('should not match when the named capture group does not match the input', () => {
const regex = buildRegExp(namedCapture('a', 'group1'));
const match = regex.exec('b');
expect(match).toBeNull();
});

it('should handle multiple named capture groups', () => {
const regex = buildRegExp(namedCapture(['a', namedCapture('b', 'group2')], 'group1'));
const match = regex.exec('ab');
expect(match).not.toBeNull();
expect(match?.groups?.group1).toBe('ab');
expect(match?.groups?.group2).toBe('b');
});

it('should handle nested named capture groups', () => {
const regex = buildRegExp(namedCapture(['a', namedCapture('b', 'group2')], 'group1'));
const match = regex.exec('ab');
expect(match?.groups?.group1).toBe('ab');
expect(match?.groups?.group2).toBe('b');
});
});

describe('namedCapture RegEx matching', () => {
test('`named-capture` pattern', () => {
expect(namedCapture('a', 'abba')).toEqualRegex(/(?<abba>a)/);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When seeing namedCapture('a', 'abba') it's hard to tell which part is the matched pattern and which one is the name of capturing group.

expect(namedCapture('abc', 'abc')).toEqualRegex(/(?<abc>abc)/);
expect(namedCapture(oneOrMore('abc'), 'ababab')).toEqualRegex(/(?<ababab>(?:abc)+)/);
expect(oneOrMore(namedCapture('abc', 'abacab'))).toEqualRegex(/(?<abacab>abc)+/);
});

test('`named-capture` matching', () => {
expect(namedCapture('b', 'b')).toMatchGroups('ab', ['b', 'b']);
expect(['a', namedCapture('b', 'b')]).toMatchGroups('ab', ['ab', 'b']);
expect(['a', namedCapture('b', 'b'), namedCapture('c', 'c')]).toMatchGroups('abc', [
'abc',
'b',
'c',
]);
});
});
22 changes: 22 additions & 0 deletions src/constructs/backreference.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import type { EncodeResult } from '../encoder/types';
import type { GroupNumber, RegexConstruct } from '../types';

export interface Backrefence extends RegexConstruct {
type: 'backreference';
group: GroupNumber;
}

export function backreference(groupNumber: GroupNumber): Backrefence {
return {
type: 'backreference',
group: groupNumber,
encode: encodeCapture,
};
}

function encodeCapture(this: Backrefence): EncodeResult {
return {
precedence: 'atom',
pattern: `\\${this.group}`,
};
}
Loading