Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22789 Add Segmenter API to conveniently wrap BreakIterator #3237

Draft
wants to merge 44 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
d3de927
ICU-22789 Add initial code for Segmenter interface, concrete impl, Se…
echeran Oct 8, 2024
6ce5fdc
ICU-22789 Finish impl for LocalizedSegmenter
echeran Oct 8, 2024
858d790
ICU-22789 Add an initial impl for RuleBasedSegmenter
echeran Oct 8, 2024
1313ff0
ICU-22789 Fix typos
echeran Oct 8, 2024
aa4724a
ICU-22789 Refactor duplicate impl of ranges into default interface me…
echeran Oct 9, 2024
5defe25
ICU-22789 Refactor `Segments` to only create BreakIterator once per s…
echeran Oct 9, 2024
4d2f1d0
ICU-22789 Make the Segments.ranges() Stream lazy
echeran Oct 29, 2024
cf44d9f
Move the SegmentationType into Segmenter b/c it is not specific to Lo…
echeran Nov 5, 2024
325d72a
formatting typo
echeran Nov 7, 2024
f9a8dc2
Rename enum value to match Unicode terminology for clarity purposes
echeran Nov 15, 2024
b88e32d
Shorten names of inner classes to remove redundant prefix
echeran Nov 19, 2024
652d49a
ICU-22789 Create subpackage for Segmenter related classes
echeran Dec 4, 2024
032cf04
ICU-22789 Parameterize iteration by direction of iteration
echeran Dec 5, 2024
faafbcb
ICU-22789 Create a public `Function` to convert a range to a string
echeran Dec 5, 2024
3f62900
ICU-22789 Add API for backwards direction lazy Stream
echeran Dec 5, 2024
7c0c366
ICU-22789 Match BreakIterator behavior to always advance from start pos
echeran Dec 5, 2024
1f4b19a
ICU-22789 Add `rangeAfterIndex` and `rangeBeforeIndex`
echeran Dec 10, 2024
0358238
ICU-22789 Genericize source string type as CharSequence instead of St…
echeran Dec 12, 2024
a57730f
ICU-22789 Promote `int` to `long` for ICU `assertEquals` test compare…
echeran Dec 12, 2024
e20fa35
ICU-22789 Fix bug for next range when only `limit`==`DONE` but not `s…
echeran Dec 12, 2024
03f1e07
ICU-22789 Add APIs for IntStream of boundary indices
echeran Dec 12, 2024
7f5dbd0
ICU-22789 Add logKnownIssue in test for other API where it pertains
echeran Dec 12, 2024
55226ac
ICU-22789 Minor formatting typo
echeran Dec 12, 2024
60ea6f3
ICU-22789 Refactor `Range` into `Segment`
echeran Dec 13, 2024
94ec357
ICU-22789 Make followup adjustments to Segment field accessors after …
echeran Dec 31, 2024
5b6eadd
ICU-22789 Refactor default impls of `Segments` interface into reusabl…
echeran Dec 31, 2024
8ee08e3
ICU-22789 Add source CharSequence to Segment class
echeran Jan 2, 2025
767789a
ICU-22789 Remove unused getters
echeran Jan 2, 2025
d9445d2
ICU-22789 Move SegmentationType enum back into LocalizedSegmenter
echeran Jan 2, 2025
9efe2df
ICU-22789 Mark getNewBreakIterator internal until we can remove it
echeran Jan 2, 2025
e0f3554
ICU-22789 Use interface type in declarations used in tests
echeran Jan 2, 2025
f12d724
ICU-22789 Create top level classes for builders of concrete Segmenter…
echeran Jan 2, 2025
0e2b1db
ICU-22789 Add isBoundary API for Segments interface
echeran Jan 2, 2025
9fbcc8a
ICU-22789 Rename boundariesAfter API for Segments interface
echeran Jan 3, 2025
4163a6d
ICU-22789 Rename and adjust boundary logic for boundariesBackFrom API…
echeran Jan 3, 2025
d9017e0
ICU-22789 Fix typos, add TODOs for future optimization design
echeran Jan 3, 2025
47ffdd8
ICU-22789 Add segmentAt API for Segments interface
echeran Jan 3, 2025
e500e42
ICU-22789 Rename and adjust logic for Stream<Segment>-returning APIs
echeran Jan 3, 2025
f9f4d04
ICU-22789 Fix boundary condition behavior for segmentsBefore API
echeran Jan 4, 2025
cf979a4
ICU-22789 Fix naming of APIs further
echeran Jan 4, 2025
1936402
Merge branch 'main' into breakiter-api-modern
echeran Jan 8, 2025
f590068
ICU-22789 Fix localized segmenter test by not expecting locale tailor…
echeran Jan 9, 2025
036eb6e
Revert "ICU-22789 Create top level classes for builders of concrete S…
echeran Jan 9, 2025
124c487
ICU-22789 Remove `segmentAfterIndex` and `segmentBeforeIndex`
echeran Jan 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
package com.ibm.icu.text.segmenter;

import com.ibm.icu.text.BreakIterator;
import com.ibm.icu.util.ULocale;
import java.util.function.Function;
import java.util.stream.IntStream;
import java.util.stream.Stream;

public class LocalizedSegmenter implements Segmenter {

private ULocale locale;

private SegmentationType segmentationType;

@Override
public Segments segment(CharSequence s) {
return new LocalizedSegments(s, this);
}

public static Builder builder() {
return new Builder();
}

LocalizedSegmenter(ULocale locale, SegmentationType segmentationType) {
this.locale = locale;
this.segmentationType = segmentationType;
}

/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Override
@Deprecated
public BreakIterator getNewBreakIterator() {
BreakIterator breakIter;
switch (this.segmentationType) {
case LINE:
breakIter = BreakIterator.getLineInstance(this.locale);
break;
case SENTENCE:
breakIter = BreakIterator.getSentenceInstance(this.locale);
break;
case WORD:
breakIter = BreakIterator.getWordInstance(this.locale);
break;
case GRAPHEME_CLUSTER:
default:
breakIter = BreakIterator.getCharacterInstance(this.locale);
break;
}
return breakIter;
}

public enum SegmentationType {
GRAPHEME_CLUSTER,
WORD,
LINE,
SENTENCE,
}

public static class Builder {

private ULocale locale = ULocale.ROOT;

private SegmentationType segmentationType = SegmentationType.GRAPHEME_CLUSTER;

Builder() { }

public Builder setLocale(ULocale locale) {
this.locale = locale;
return this;
}

public Builder setSegmentationType(SegmentationType segmentationType) {
this.segmentationType = segmentationType;
return this;
}

public LocalizedSegmenter build() {
return new LocalizedSegmenter(this.locale, this.segmentationType);
}

}

public class LocalizedSegments implements Segments {

private CharSequence source;

private LocalizedSegmenter segmenter;

private BreakIterator breakIter;

private LocalizedSegments(CharSequence source, LocalizedSegmenter segmenter) {
this.source = source;
this.segmenter = segmenter;
this.breakIter = this.segmenter.getNewBreakIterator();
}

@Override
public Stream<CharSequence> subSequences() {
return SegmentsImplUtils.subSequences(this.breakIter, this.source);
}

@Override
public Segment segmentAt(int i) {
return SegmentsImplUtils.segmentAt(this.breakIter, this.source, i);
}

@Override
public Stream<Segment> segments() {
return SegmentsImplUtils.segments(this.breakIter, this.source);
}

@Override
public boolean isBoundary(int i) {
return SegmentsImplUtils.isBoundary(this.breakIter, this.source, i);
}

@Override
public Stream<Segment> segmentsFrom(int i) {
return SegmentsImplUtils.segmentsFrom(this.breakIter, this.source, i);
}

@Override
public Stream<Segment> segmentsBefore(int i) {
return SegmentsImplUtils.segmentsBefore(this.breakIter, this.source, i);
}

@Override
public Function<Segment, CharSequence> segmentToSequenceFn() {
return SegmentsImplUtils.segmentToSequenceFn(this.source);
}

@Override
public IntStream boundaries() {
return SegmentsImplUtils.boundaries(this.breakIter, this.source);
}

@Override
public IntStream boundariesAfter(int i) {
return SegmentsImplUtils.boundariesAfter(this.breakIter, this.source, i);
}

@Override
public IntStream boundariesBackFrom(int i) {
return SegmentsImplUtils.boundariesBackFrom(this.breakIter, this.source, i);
}
}

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
package com.ibm.icu.text.segmenter;

import com.ibm.icu.text.BreakIterator;
import com.ibm.icu.text.RuleBasedBreakIterator;
import java.util.function.Function;
import java.util.stream.IntStream;
import java.util.stream.Stream;

public class RuleBasedSegmenter implements Segmenter {

private String rules;

@Override
public Segments segment(CharSequence s) {
return new RuleBasedSegments(s, this);
}

public static Builder builder() {
return new Builder();
}

RuleBasedSegmenter(String rules) {
this.rules = rules;
}

/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Override
@Deprecated
public RuleBasedBreakIterator getNewBreakIterator() {
return new RuleBasedBreakIterator(this.rules);
}

public static class Builder {

String rules;

Builder() { }

public Builder setRules(String rules) {
this.rules = rules;
return this;
}

public RuleBasedSegmenter build() {
return new RuleBasedSegmenter(this.rules);
}
}

public static class RuleBasedSegments implements Segments {
private CharSequence source;

private RuleBasedSegmenter segmenter;

private BreakIterator breakIter;

RuleBasedSegments(CharSequence source, RuleBasedSegmenter segmenter) {
this.source = source;
this.segmenter = segmenter;
this.breakIter = this.segmenter.getNewBreakIterator();
}

@Override
public Stream<CharSequence> subSequences() {
return SegmentsImplUtils.subSequences(this.breakIter, this.source);
}

@Override
public Segment segmentAt(int i) {
return SegmentsImplUtils.segmentAt(this.breakIter, this.source, i);
}

@Override
public Stream<Segment> segments() {
return SegmentsImplUtils.segments(this.breakIter, this.source);
}

@Override
public boolean isBoundary(int i) {
return SegmentsImplUtils.isBoundary(this.breakIter, this.source, i);
}

@Override
public Stream<Segment> segmentsFrom(int i) {
return SegmentsImplUtils.segmentsFrom(this.breakIter, this.source, i);
}

@Override
public Stream<Segment> segmentsBefore(int i) {
return SegmentsImplUtils.segmentsBefore(this.breakIter, this.source, i);
}

@Override
public Function<Segment, CharSequence> segmentToSequenceFn() {
return SegmentsImplUtils.segmentToSequenceFn(this.source);
}

@Override
public IntStream boundaries() {
return SegmentsImplUtils.boundaries(this.breakIter, this.source);
}

@Override
public IntStream boundariesAfter(int i) {
return SegmentsImplUtils.boundariesAfter(this.breakIter, this.source, i);
}

@Override
public IntStream boundariesBackFrom(int i) {
return SegmentsImplUtils.boundariesBackFrom(this.breakIter, this.source, i);
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
package com.ibm.icu.text.segmenter;

import com.ibm.icu.text.BreakIterator;

public interface Segmenter {
Segments segment(CharSequence s);

/**
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
BreakIterator getNewBreakIterator();

}
Loading
Loading