-
Notifications
You must be signed in to change notification settings - Fork 4
feat: backreferences, named capture groups, named backreferences #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
module.exports = { | ||
presets: ['@babel/preset-env', '@babel/preset-typescript'], | ||
presets: [['@babel/preset-env', { targets: { node: '10.0' } }], '@babel/preset-typescript'], | ||
}; |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,6 +74,41 @@ Captures, also known as capturing groups, extract and store parts of the matched | |
> [!NOTE] | ||
> TS Regex Builder does not have a construct for non-capturing groups. Such groups are implicitly added when required. E.g., `zeroOrMore(["abc"])` is encoded as `(?:abc)+`. | ||
|
||
### `backreference()` | ||
|
||
```ts | ||
function backreference( | ||
groupNumber: number | ||
): Backreference | ||
``` | ||
|
||
Regex syntax: `\1 \2`. | ||
|
||
A backreference is a way to match the same text as previously matched by a capturing group. | ||
|
||
### `namedCapture()` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've considered option of having a separate |
||
|
||
```ts | ||
function namedCapture( | ||
sequence: RegexSequence | ||
): NamedCapture | ||
``` | ||
|
||
Regex syntax: `(...)`. | ||
|
||
A named capturing group is a capturing group that give a name to the group. The group's matching result can later be identified by this name. | ||
|
||
### `namedBackreference()` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Swift Regex Builder uses following convention for named captures/references: let kind = Reference(Substring.self)
let regex = Capture(as: kind) {
ChoiceOf {
"CREDIT"
"DEBIT"
}
} see: https://github.com/apple/swift-evolution/blob/main/proposals/0351-regex-builder.md#reference It has a nice feature of connecting reference straight to capturing group, instead of forcing user to repeat the name twice, once in However, in such case dropping "back" prefix seems beneficial, as reference becomes "backreference" only when added to regular expression. Until it's applied to the previous part of the express ("back"), is more of reference. |
||
|
||
```ts | ||
function namedBackreference( | ||
groupName: string | ||
): NamedBackreference | ||
``` | ||
|
||
Regex syntax: `\k{groupName}`. | ||
|
||
A named backreference is a way to match the same text as previously matched by a named capturing group. | ||
### `lookahead()` | ||
|
||
```ts | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
import { buildRegExp, digit, endOfString, namedCapture, repeat, startOfString } from '..'; | ||
|
||
// Example: dateRegex | ||
const dateRegex = /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/i; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nice one, I'll add this example. |
||
const yearRegex = namedCapture(repeat(digit, 4), 'year'); | ||
const monthRegex = namedCapture(repeat(digit, 2), 'month'); | ||
const dayRegex = namedCapture(repeat(digit, 2), 'day'); | ||
const regex = buildRegExp([startOfString, yearRegex, '-', monthRegex, '-', dayRegex, endOfString], { | ||
ignoreCase: true, | ||
}); | ||
|
||
test('dateRegex', () => { | ||
expect(dateRegex).toEqual(regex); | ||
}); | ||
|
||
test('dateRegex matching', () => { | ||
expect(dateRegex).toMatchGroups('2021-08-24', ['2021-08-24', '2021', '08', '24']); | ||
}); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
import { | ||
anyOf, | ||
buildRegExp, | ||
charClass, | ||
charRange, | ||
digit, | ||
endOfString, | ||
namedCapture, | ||
oneOrMore, | ||
repeat, | ||
startOfString, | ||
} from '..'; | ||
|
||
// | ||
// Example: email validation building blocks | ||
// | ||
const upperCase = charRange('A', 'Z'); | ||
const lowerCase = charRange('a', 'z'); | ||
const specialChars = anyOf("!#$%&'*+/=?^_`{|}~-"); | ||
const usernameChars = charClass(upperCase, lowerCase, digit, specialChars); | ||
const hostnameChars = charClass(upperCase, lowerCase, digit, specialChars); | ||
const domainChars = charRange('a', 'z'); | ||
const emailSeparator = anyOf('.'); | ||
const domainSeparator = anyOf('@'); | ||
|
||
// | ||
// Example: email validation majour components using named capture. | ||
// | ||
const username = namedCapture(oneOrMore(usernameChars), 'username'); | ||
|
||
const usernameRegex = buildRegExp([startOfString, username, endOfString]); | ||
|
||
test('Matching the Username component.', () => { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll merge this with existing email example, as there quite similar. Note: I am planning to add some frequently used patterns (URL, email, maybe hashtags, etc). So that each user does not have to define them by hand. I will soon spec-out this feature. I invite you to join in if you have capacity to work for that. |
||
expect(usernameRegex).toMatchString('john1234'); | ||
expect(usernameRegex).toMatchString('ringo$1234'); | ||
expect(usernameRegex).not.toMatchString('john@1234'); | ||
expect(usernameRegex).not.toMatchString('george.harrison'); | ||
expect(usernameRegex).not.toMatchString('paul.mccartney&wings'); | ||
expect(usernameRegex).not.toMatchString('ringo starr'); | ||
}); | ||
|
||
const hostname = namedCapture(oneOrMore(hostnameChars), 'hostname'); | ||
|
||
const hostnameRegex = buildRegExp([startOfString, hostname, endOfString]); | ||
|
||
test('Matching the Hostname component.', () => { | ||
expect(hostnameRegex).toMatchString('gmail'); | ||
expect(hostnameRegex).toMatchString('google'); | ||
expect(hostnameRegex).toMatchString('g-mail'); | ||
expect(hostnameRegex).toMatchString('g_mail'); | ||
expect(hostnameRegex).not.toMatchString('g mail'); | ||
expect(hostnameRegex).not.toMatchString('g.mail'); | ||
}); | ||
|
||
const domain = namedCapture(repeat(domainChars, { min: 2 }), 'domain'); | ||
|
||
const domainRegex = buildRegExp([startOfString, domain, endOfString]); | ||
|
||
test('Matching the Domain component.', () => { | ||
expect(domainRegex).toMatchString('com'); | ||
expect(domainRegex).toMatchString('org'); | ||
expect(domainRegex).not.toMatchString('c'); | ||
expect(domainRegex).not.toMatchString('o'); | ||
expect(domainRegex).toMatchString('co'); | ||
}); | ||
|
||
test('example: email validation', () => { | ||
const regex = buildRegExp( | ||
[startOfString, username, domainSeparator, hostname, emailSeparator, domain, endOfString], | ||
{ ignoreCase: true }, | ||
); | ||
|
||
expect(regex).toMatchString('[email protected]'); | ||
expect(regex).toMatchString('[email protected]'); | ||
expect(regex).toMatchString('[email protected]'); | ||
expect(regex).not.toMatchString('[email protected]'); | ||
|
||
expect(regex).not.toMatchString('@'); | ||
expect(regex).not.toMatchString('aaa@'); | ||
expect(regex).not.toMatchString('[email protected]'); | ||
expect(regex).not.toMatchString('@gmail.com'); | ||
|
||
const emailAddress = '[email protected]'; | ||
const match = regex.exec(emailAddress); | ||
expect(match).not.toBeNull(); | ||
expect(match?.groups).not.toBeNull(); | ||
expect(match?.groups?.username).toBe('abba'); | ||
expect(match?.groups?.hostname).toBe('gold'); | ||
expect(match?.groups?.domain).toBe('com'); | ||
}); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
import { backreference, buildRegExp, capture } from '../..'; | ||
|
||
describe('backreference function', () => { | ||
it('should create a backreference to a previously captured group', () => { | ||
const group = capture('a'); | ||
const backRef = backreference(1); | ||
const groupRegex = buildRegExp([group, backRef]); | ||
|
||
const match = groupRegex.exec('aa'); | ||
expect(match).not.toBeNull(); | ||
expect(match?.[0]).toBe('aa'); | ||
}); | ||
|
||
it('should not match when the backreference does not match the captured group', () => { | ||
const group = capture('a'); | ||
const backRef = backreference(1); | ||
const groupRegex = buildRegExp([group, backRef]); | ||
|
||
const match = groupRegex.exec('a\\1'); | ||
expect(match).toBeNull(); | ||
}); | ||
|
||
it('should allow references in multiple backreferences', () => { | ||
const group1 = capture('a'); | ||
const group2 = capture('b'); | ||
const backRef1 = backreference(1); | ||
const backRef2 = backreference(2); | ||
const groupRegex = buildRegExp([group1, group2, backRef1, backRef2]); | ||
const match = groupRegex.exec('aabb'); | ||
expect(match).toBeNull(); | ||
}); | ||
|
||
it('should handle multiple valid backreferences', () => { | ||
const group1 = capture('a'); | ||
const group2 = capture('b'); | ||
const backRef1 = backreference(1); | ||
const backRef2 = backreference(2); | ||
const groupRegex = buildRegExp([group1, group2, backRef1, backRef2]); | ||
|
||
const match = groupRegex.exec('aabb'); | ||
expect(match).toBeNull(); | ||
}); | ||
}); | ||
|
||
it('should handle backreferences in different order', () => { | ||
const group1 = capture('a'); | ||
const group2 = capture('b'); | ||
const backRef1 = backreference(2); | ||
const backRef2 = backreference(1); | ||
const groupRegex = buildRegExp([group1, group2, backRef1, backRef2]); | ||
|
||
const match = groupRegex.exec('abba'); | ||
expect(match).not.toBeNull(); | ||
expect(match?.[0]).toBe('abba'); | ||
}); | ||
|
||
it('should not match when the backreference does not match the captured group', () => { | ||
const group = capture('a'); | ||
const backRef = backreference(1); | ||
const groupRegex = buildRegExp([group, backRef]); | ||
|
||
const match = groupRegex.exec('abba'); | ||
expect(match).toBeNull(); | ||
}); | ||
|
||
it('should handle multiple backreferences to the same group', () => { | ||
const group1 = capture('a'); | ||
const backRef1 = backreference(1); | ||
const backRef2 = backreference(1); | ||
const groupRegex = buildRegExp([group1, backRef1, backRef2]); | ||
|
||
const match = groupRegex.exec('aaa'); | ||
expect(match).not.toBeNull(); | ||
expect(match?.[0]).toBe('aaa'); | ||
}); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import { buildRegExp, namedBackreference, namedCapture } from '../..'; | ||
|
||
describe('named-backreference function', () => { | ||
it('should create a backreference to a previously captured group', () => { | ||
const group = namedCapture('a', 'groupA'); | ||
const groupRef = namedBackreference('groupA'); | ||
const groupRegex = buildRegExp([group, groupRef]); | ||
|
||
const match = groupRegex.exec('aa'); | ||
expect(match).not.toBeNull(); | ||
expect(match?.[0]).toBe('aa'); | ||
}); | ||
|
||
it('should not match when the backreference does not match the captured group', () => { | ||
const group = namedCapture('a', 'groupA'); | ||
const groupRef = namedBackreference('groupA'); | ||
const groupRegex = buildRegExp([group, groupRef]); | ||
|
||
const match = groupRegex.exec('a\\1'); | ||
expect(match).toBeNull(); | ||
}); | ||
|
||
it('should allow references in multiple backreferences', () => { | ||
const group1 = namedCapture('a', 'groupA'); | ||
const group2 = namedCapture('b', 'groupB'); | ||
const groupARef = namedBackreference('groupA'); | ||
const groupBRef = namedBackreference('groupB'); | ||
const groupRegex = buildRegExp([group1, group2, groupARef, groupBRef]); | ||
|
||
const match = groupRegex.exec('aabb'); | ||
expect(match).toBeNull(); | ||
}); | ||
|
||
it('should handle multiple valid backreferences', () => { | ||
const group1 = namedCapture('ab', 'groupA'); | ||
const group2 = namedCapture('ba', 'groupB'); | ||
const groupARef = namedBackreference('groupA'); | ||
const groupBRef = namedBackreference('groupB'); | ||
const groupRegex = buildRegExp([group1, group2, groupARef, groupBRef]); | ||
const match = groupRegex.exec('abbaabba'); | ||
expect(match).not.toBeNull(); | ||
}); | ||
}); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
import { buildRegExp, namedCapture, oneOrMore } from '../..'; | ||
|
||
describe('namedCapture function', () => { | ||
it('should create a named capture group', () => { | ||
const regex = buildRegExp(namedCapture('a', 'group1')); | ||
const match = regex.exec('a'); | ||
expect(match?.groups?.group1).toBe('a'); | ||
}); | ||
|
||
it('should not match when the named capture group does not match the input', () => { | ||
const regex = buildRegExp(namedCapture('a', 'group1')); | ||
const match = regex.exec('b'); | ||
expect(match).toBeNull(); | ||
}); | ||
|
||
it('should handle multiple named capture groups', () => { | ||
const regex = buildRegExp(namedCapture(['a', namedCapture('b', 'group2')], 'group1')); | ||
const match = regex.exec('ab'); | ||
expect(match).not.toBeNull(); | ||
expect(match?.groups?.group1).toBe('ab'); | ||
expect(match?.groups?.group2).toBe('b'); | ||
}); | ||
|
||
it('should handle nested named capture groups', () => { | ||
const regex = buildRegExp(namedCapture(['a', namedCapture('b', 'group2')], 'group1')); | ||
const match = regex.exec('ab'); | ||
expect(match?.groups?.group1).toBe('ab'); | ||
expect(match?.groups?.group2).toBe('b'); | ||
}); | ||
}); | ||
|
||
describe('namedCapture RegEx matching', () => { | ||
test('`named-capture` pattern', () => { | ||
expect(namedCapture('a', 'abba')).toEqualRegex(/(?<abba>a)/); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When seeing |
||
expect(namedCapture('abc', 'abc')).toEqualRegex(/(?<abc>abc)/); | ||
expect(namedCapture(oneOrMore('abc'), 'ababab')).toEqualRegex(/(?<ababab>(?:abc)+)/); | ||
expect(oneOrMore(namedCapture('abc', 'abacab'))).toEqualRegex(/(?<abacab>abc)+/); | ||
}); | ||
|
||
test('`named-capture` matching', () => { | ||
expect(namedCapture('b', 'b')).toMatchGroups('ab', ['b', 'b']); | ||
expect(['a', namedCapture('b', 'b')]).toMatchGroups('ab', ['ab', 'b']); | ||
expect(['a', namedCapture('b', 'b'), namedCapture('c', 'c')]).toMatchGroups('abc', [ | ||
'abc', | ||
'b', | ||
'c', | ||
]); | ||
}); | ||
}); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import type { EncodeResult } from '../encoder/types'; | ||
import type { GroupNumber, RegexConstruct } from '../types'; | ||
|
||
export interface Backrefence extends RegexConstruct { | ||
type: 'backreference'; | ||
group: GroupNumber; | ||
} | ||
|
||
export function backreference(groupNumber: GroupNumber): Backrefence { | ||
return { | ||
type: 'backreference', | ||
group: groupNumber, | ||
encode: encodeCapture, | ||
}; | ||
} | ||
|
||
function encodeCapture(this: Backrefence): EncodeResult { | ||
return { | ||
precedence: 'atom', | ||
pattern: `\\${this.group}`, | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ordinal backreferences accurately might be problematic in case of more complex expressions, nesting, etc. Therefore, I think we can drop them without loosing any functionality to the user trying to build maintainable regexes.