Skip to content

Conversation

@junegunn
Copy link
Member

@junegunn junegunn commented Nov 3, 2025

@junegunn junegunn self-assigned this Nov 3, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for a limit field to the Scan protobuf message, allowing scans to specify a maximum number of rows to return. The implementation integrates this limit with the existing configuration-based limit in table snapshot input format.

Key Changes

  • Added optional uint32 limit field to the Scan protobuf message
  • Updated serialization/deserialization logic in ProtobufUtil to handle the limit field
  • Modified TableSnapshotInputFormatImpl to merge scan limit with configuration limit, using the minimum of both when both are specified
  • Added comprehensive test coverage including serialization tests and MapReduce integration tests

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
hbase-protocol-shaded/src/main/protobuf/client/Client.proto Added optional uint32 limit field to Scan message
hbase-client/src/main/java/org/apache/hadoop/hbase/shaded/protobuf/ProtobufUtil.java Added serialization/deserialization logic for scan limit field
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java Implemented logic to merge configuration and scan-based limits
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableSnapshotInputFormat.java Refactored and added test cases for limit behavior
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableMapReduceUtil.java Added comprehensive serialization test with limit field
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormat.java Added test for limit=1 behavior with serialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +260 to +263
int confLimit = conf.getInt(SNAPSHOT_INPUTFORMAT_ROW_LIMIT_PER_INPUTSPLIT, 0);
int scanLimit = Math.max(scan.getLimit(), 0);
this.rowLimitPerSplit =
confLimit == 0 ? scanLimit : scanLimit == 0 ? confLimit : Math.min(confLimit, scanLimit);
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The logic for merging confLimit and scanLimit uses nested ternary operators which are difficult to read and maintain. Consider refactoring to use if-else statements or extracting this logic into a separate helper method with a descriptive name like resolveLimitPerSplit(int confLimit, int scanLimit).

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fair suggestion, but I feel the current code isn't complicated enough to warrant refactoring.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 2m 11s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 buf 0m 0s buf was not available.
+0 🆗 buf 0m 0s buf was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 17s Maven dependency ordering for branch
+1 💚 mvninstall 5m 12s master passed
+1 💚 compile 2m 55s master passed
+1 💚 checkstyle 0m 57s master passed
+1 💚 spotbugs 4m 51s master passed
+1 💚 spotless 1m 11s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 13s Maven dependency ordering for patch
+1 💚 mvninstall 4m 50s the patch passed
+1 💚 compile 2m 47s the patch passed
+1 💚 cc 2m 47s the patch passed
+1 💚 javac 2m 47s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 53s the patch passed
+1 💚 spotbugs 5m 19s the patch passed
+1 💚 hadoopcheck 14m 49s Patch does not cause any errors with Hadoop 3.3.6 3.4.1.
+1 💚 hbaseprotoc 1m 37s the patch passed
+1 💚 spotless 1m 4s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 33s The patch does not generate ASF License warnings.
59m 16s
Subsystem Report/Notes
Docker ClientAPI=1.48 ServerAPI=1.48 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7432/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7432
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless cc buflint bufcompat hbaseprotoc
uname Linux 019599eec802 6.8.0-1024-aws #26~22.04.1-Ubuntu SMP Wed Feb 19 06:54:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 42a8c05
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 71 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-mapreduce U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7432/1/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 2m 13s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 18s Maven dependency ordering for branch
+1 💚 mvninstall 5m 42s master passed
+1 💚 compile 2m 17s master passed
+1 💚 javadoc 1m 13s master passed
+1 💚 shadedjars 9m 12s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 16s Maven dependency ordering for patch
+1 💚 mvninstall 4m 54s the patch passed
+1 💚 compile 1m 49s the patch passed
+1 💚 javac 1m 49s the patch passed
+1 💚 javadoc 0m 57s the patch passed
+1 💚 shadedjars 8m 37s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 0m 54s hbase-protocol-shaded in the patch passed.
+1 💚 unit 3m 0s hbase-client in the patch passed.
+1 💚 unit 41m 35s hbase-mapreduce in the patch passed.
85m 8s
Subsystem Report/Notes
Docker ClientAPI=1.48 ServerAPI=1.48 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7432/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7432
Optional Tests javac javadoc unit compile shadedjars
uname Linux 1734650b9b51 6.8.0-1024-aws #26~22.04.1-Ubuntu SMP Wed Feb 19 06:54:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 42a8c05
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7432/1/testReport/
Max. process+thread count 1578 (vs. ulimit of 30000)
modules C: hbase-protocol-shaded hbase-client hbase-mapreduce U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7432/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

scanBuilder.setNeedCursorResult(true);
}
scanBuilder.setQueryMetricsEnabled(scan.isQueryMetricsEnabled());
if (scan.getLimit() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not set limit here? Strange, if this is the case, lots of our limit related UTs should fail?

Copy link
Member Author

@junegunn junegunn Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absence of it only matters when Protobuf serialization is involved. For example, in the existing test cases in TestTableInputFormat, 1. no serialization was performed, and 2. no Scan#setLimit call was made, so the issue never surfaced. So I guess this particular case (Protobuf serialization of a Scan object with a non-default limit) wasn't previously covered by the unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, limit is in ScanRequest, not in Scan, Scan is just part of a ScanRequest.

So we should not change the code here.

Comment on lines +355 to +373
@Test
public void testScanLimitOnConfig() throws Exception {
testScanLimit(10, new Scan(), 10);
}

@Test
public void testScanLimitOnScan() throws Exception {
testScanLimit(0, new Scan().setLimit(10), 10);
}

@Test
public void testScanLimitOnBothButScanWins() throws Exception {
testScanLimit(10, new Scan().setLimit(5), 5);
}

@Test
public void testScanLimitOnBothButConfigWins() throws Exception {
testScanLimit(5, new Scan().setLimit(10), 5);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To verify that the smaller value from the two sources is chosen. Maybe a bit overkill for such a simple logic? Considering each case takes quite a while to run.

Copy link
Contributor

@Apache9 Apache9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand the problem here.

In real usage, how can we set the limit on the Scan object? When using TableInputFormat or TableSnapshotInputFormat, there is no way for us to pass a Scan object in?

@junegunn
Copy link
Member Author

junegunn commented Nov 3, 2025

When we run an MR job or a Spark job, we define how we want to scan a table or a snapshot via a Scan object, serialize it into a String, and set it as the hbase.mapreduce.scan (TableInputFormat.SCAN) property of a Configuration.

/**
* Use this before submitting a TableMap job. It will appropriately set up the job.
* @param table The table name to read from.
* @param scan The scan instance with the columns, time range etc.
* @param mapper The mapper class to use.
* @param outputKeyClass The class of the output key.
* @param outputValueClass The class of the output value.
* @param job The current job to adjust. Make sure the passed job is carrying all
* necessary HBase configuration.
* @param addDependencyJars upload HBase jars and jars for any of the configured job classes via
* the distributed cache (tmpjars).
* @param initCredentials whether to initialize hbase auth credentials for the job
* @param inputFormatClass the input format
* @throws IOException When setting up the details fails.
*/
public static void initTableMapperJob(String table, Scan scan,
Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass,
Job job, boolean addDependencyJars, boolean initCredentials,
Class<? extends InputFormat> inputFormatClass) throws IOException {
job.setInputFormatClass(inputFormatClass);
if (outputValueClass != null) job.setMapOutputValueClass(outputValueClass);
if (outputKeyClass != null) job.setMapOutputKeyClass(outputKeyClass);
job.setMapperClass(mapper);
if (Put.class.equals(outputValueClass)) {
job.setCombinerClass(PutCombiner.class);
}
Configuration conf = job.getConfiguration();
HBaseConfiguration.merge(conf, HBaseConfiguration.create(conf));
conf.set(TableInputFormat.INPUT_TABLE, table);
conf.set(TableInputFormat.SCAN, convertScanToString(scan));
conf.setStrings("io.serializations", conf.get("io.serializations"),
MutationSerialization.class.getName(), ResultSerialization.class.getName(),
CellSerialization.class.getName());
if (addDependencyJars) {
addDependencyJars(job);
}
if (initCredentials) {
initCredentials(job);
}
}

TableInputFormat and TableSnapshotInputFormat then deserialize this back into a Scan.

@Apache9
Copy link
Contributor

Apache9 commented Nov 4, 2025

OK, so the problem is that, we do not store limit in the serialized scan object, so when deserializing, we can not get the limit.

But the design of the client API is that, we do not pass the actual limit through the scan object, since it is the global limit, not per region, so we need to calculate the remaining limit for each region and pass it to region server.

Maybe a possible way is to add a new config field where we serialize the Scan object with json? In this way we can add all the fields we want.

Thanks.

@junegunn
Copy link
Member Author

junegunn commented Nov 4, 2025

Thanks for the comment.

OK, so the problem is that, we do not store limit in the serialized scan object, so when deserializing, we can not get the limit.

Correct. When you pass a Scan with a custom limit to an MR job, you would expect each mapper to return at most that number of rows, but instead you end up getting all records in the table.

since it is the global limit, not per region

I assumed that users advanced enough to run MR/Spark jobs with HBase would already understand that, in that context, each partition (region) runs its own Scan in parallel, and that setLimit applies locally to each partition. Analogous to how you can't enforce a global limit using PageFilter, etc. But I could be wrong, maybe I just know too much about the quirks and limitations :)

When this patch is applied, users might be surprised to see that an MR job using setLimit returns "limit × regions" number of rows, but I think that's still better than having setLimit silently ignored.

As for storing the limit value in the serialized field, I don't see any problem with it. It might look a bit redundant, but it's harmless because it's not used anywhere else (please correct me if I'm wrong) and serves as a more accurate description of the original Scan itself.

@Apache9
Copy link
Contributor

Apache9 commented Nov 4, 2025

Thanks for the comment.

OK, so the problem is that, we do not store limit in the serialized scan object, so when deserializing, we can not get the limit.

Correct. When you pass a Scan with a custom limit to an MR job, you would expect each mapper to return at most that number of rows, but instead you end up getting all records in the table.

since it is the global limit, not per region

I assumed that users advanced enough to run MR/Spark jobs with HBase would already understand that, in that context, each partition (region) runs its own Scan in parallel, and that setLimit applies locally to each partition. Analogous to how you can't enforce a global limit using PageFilter, etc. But I could be wrong, maybe I just know too much about the quirks and limitations :)

When this patch is applied, users might be surprised to see that an MR job using setLimit returns "limit × regions" number of rows, but I think that's still better than having setLimit silently ignored.

OK, I think this is the reason that why we do not consider scan limit in the past, under a parallel execution scenario, a global limit does not make sense. The rowsLimitPerSplit configuration is better as it says 'per split', so maybe we should also introduce a config in TableInputFormat? And when serializing the Scan object, if we find users setting scan limit, we log a warn message to tell users that this will not work?

As for storing the limit value in the serialized field, I don't see any problem with it. It might look a bit redundant, but it's harmless because it's not used anywhere else (please correct me if I'm wrong) and serves as a more accurate description of the original Scan itself.

It will increase the message size for Scan object and does not bring any advantages for Scan, so for me I prefer we do not add it if possible...

@junegunn
Copy link
Member Author

junegunn commented Nov 4, 2025

It will increase the message size for Scan object and does not bring any advantages for Scan, so for me I prefer we do not add it if possible...

Fair enough, however small, it's still wasteful. If I'm not mistaken, the serialized size does not increase if we don't set the optional field. So how about overloading ProtobufUtil.toScan with a version that skips the field, and using that in normal scans (in RequestConverter.buildScanRequest)? It would benefit MR users without introducing unnecessary overhead to normal scans. If that sounds reasonable, I can update the patch; otherwise, I'd say we leave this as a known issue and close it as "wontfix".

@Apache9
Copy link
Contributor

Apache9 commented Nov 4, 2025

Oh, there are some formatting problem with my reply so maybe you missed part of it...

As I explained above, scan limit is a global limit, under parallel scenario a global limit does not make sense...

The rowsLimitPerSplit configuration is better as it says 'per split', so maybe we should also introduce a config in TableInputFormat? And when serializing the Scan object, if we find users setting scan limit, we log a warn message to tell users that this will not work?

@junegunn
Copy link
Member Author

junegunn commented Nov 4, 2025

Thanks, that is one option. But can we update the documentation of Scan#setLimit to say:

... When used with TableInputFormat or TableSnapshotInputFormat, this limit is applied locally to each split.

From a user’s perspective, even having a per-split limit is still an improvement over the current behavior.

As for introducing a new dedicated option, I'd personally prefer fixing and reusing the existing interface. As long as we clearly document its behavior, I think it's better than introducing another configuration parameter for users to discover. HBase already has too many little-known parameters buried deep in the source code.

@Apache9
Copy link
Contributor

Apache9 commented Nov 5, 2025

Thanks, that is one option. But can we update the documentation of Scan#setLimit to say:

... When used with TableInputFormat or TableSnapshotInputFormat, this limit is applied locally to each split.

From a user’s perspective, even having a per-split limit is still an improvement over the current behavior.

As for introducing a new dedicated option, I'd personally prefer fixing and reusing the existing interface. As long as we clearly document its behavior, I think it's better than introducing another configuration parameter for users to discover. HBase already has too many little-known parameters buried deep in the source code.

We'd better follow the same pattern with TableSnapshotInputFormat, where we introduce a configuration to set limit per split, instead of passing it through Scan. This is a more clear solution. Not all users will look deeply into the javadoc and we introduce different meanings when using Scan limit through normal scan and map reduce job if go with your solution, which may confuse our users...

@junegunn
Copy link
Member Author

junegunn commented Nov 6, 2025

I understand your point.

Not all users will look deeply into the javadoc

True, and unfortunately, those users will still complain that Scan#setLimit doesn't work as expected no matter what. So we should:

  • Document the limitation of the method in the javadoc ("this doesn't work with TableInputFormat"),
  • And in case it's overlooked, print a warning message when serializing a Scan with a limit for an MR job.

In order to do that, we need to introduce an internal version of ProtobufUtil.toScan that doesn't print a warning message and use it in RequestConverter.buildScanRequest. However, the public ProtobufUtil.toScan cannot tell if users are setting the new per-split-limit parameter in their configuration, or they are aware of the limitation; cases where the warning message can feel redundant. Users would then have to manually unset the limit to silence it, which is not ideal, so I'm not entirely sure about adding the warning.

introduce different meanings when using Scan limit

I thought it was acceptable, because we already have constructs that behave differently in parallel scenarios (e.g. stateful filters like PageFilter and WhileMatchFilter).

This is because the filter is applied separately on different region servers.
https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/PageFilter.html

So I assumed it was already well-understood that a separate Scan operates per split in such cases. But maybe that's just me.

@Apache9
Copy link
Contributor

Apache9 commented Nov 6, 2025

But we already have a rowsLimitPerSplit for TableSnapshotInputFormat right? So aligning the logic with TableInputFormat and TableSnapshotInputFormat is natural, and we also do not need to care about the serialization problem. The logic will be complicated if we support both(as in this PR).

@junegunn
Copy link
Member Author

junegunn commented Nov 6, 2025

I see. Actually, when I opened this PR I was considering deprecating the row.limit.per.inputsplit configuration, because with Scan#setLimit, we have a unified interface for per-split limit, eliminating the need for two separate configuration parameters:

  • hbase.TableSnapshotInputFormat.row.limit.per.inputsplit
  • hbase.TableInputFormat.row.limit.per.inputsplit

However, I respect your decision if you prefer not to promote ⁠setLimit for per-split limits. In that case the first thing we should do is to document the limitation in the javadoc.

By the way, hbase.TableSnapshotInputFormat.row.limit.per.inputsplit parameter is defined in TableSnapshotInputFormatImpl which is marked @InterfaceAudience.Private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants