profile
viewpoint
Artsiom Yudovin ayudovin @Altoros Minsk Contributor to Spring-Boot and Apache Spark

ayudovin/YCSB 1

Yahoo! Cloud Serving Benchmark

Altoros/Spark-Optimization-Tutorial 0

Public repository for Spark-Optimization-Tutorial

ayudovin/clickhouse-jdbc 0

JDBC driver for ClickHouse

ayudovin/fakeit 0

Generates JSON documents based on models defined in YAML and adds them to a Couchbase Bucket

ayudovin/spark 0

Apache Spark

ayudovin/Spark-Optimization-Tutorial 0

Spark Optimization Tutorial

ayudovin/spring-boot 0

Spring Boot

ayudovin/spring-security 0

Spring Security

pull request commentapache/spark

[SPARK-30122][K8S] Support spark.kubernetes.authenticate.executor.serviceAccountName

BTW, thank you for sharing your use cases, @ecapoccia . BTW, we don't backport new features. If we merge this, this will be Apache Spark 3.1 because we already in Feature Freeze stage for 3.0.

How do you think about that, @ayudovin ?

ok, that makes sense

ayudovin

comment created time in 22 days

push eventayudovin/spark

yudovin

commit sha 3aaffa839b2853534dd59a9e7a2f856125b65539

Changing description

view details

push time in 25 days

push eventayudovin/spark

yudovin

commit sha 44cd90daf64cbcaeeaee81ee3762cb040c23153b

Cleaning code

view details

push time in 25 days

pull request commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

@ayudovin . The current status is that we are waiting the original author of this framework, @hvanhovell . We need to check whether this is not added intentionally or not. Also, #27030 (comment) is another concern no this approach because it can be misleading.

I have got, It makes sense

ayudovin

comment created time in a month

push eventayudovin/spark

yudovin

commit sha dca0658ec7bf948f92a4fbb92a5ed66ca7d6ae30

Addin fallback

view details

yudovin

commit sha ec8bed7fa51bfabdb6a14d7c52410880a4d1ad9a

Adding unit test

view details

push time in a month

push eventayudovin/spark

yudovin

commit sha 30686b4a19edf7ceb13cd1d86838d4a1dd5c861b

Updating unit test

view details

push time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    val path1 = Files.createTempDirectory("db_")+    val path2 = Files.createTempDirectory(path1, "tbl_")+    val uri1 = preparePath(path1)+    val uri2 = preparePath(path2)++    // CREATE+    val dbDefinition = createDbDefinition(uri1)++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri2))++    val tableDefinition = CatalogTable(+      identifier = TableIdentifier("tbl1", Some("db5")),+      tableType = CatalogTableType.MANAGED,+      storage = storage,+      schema = new StructType().add("id", "long"))++    val partition = CatalogTablePartition(spec = Map("key1" -> "1", "key2" -> "2"),+      storage = CatalogStorageFormat.empty)++    catalog.createDatabase(dbDefinition, ignoreIfExists = false)+    checkEvents(CreateDatabasePreEvent("db5") :: CreateDatabaseEvent("db5") :: Nil)++    catalog.createTable(tableDefinition, ignoreIfExists = false)+    checkEvents(CreateTablePreEvent("db5", "tbl1") :: CreateTableEvent("db5", "tbl1") :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = false)+    checkEvents(CreatePartitionPreEvent("db5", "tbl1", Seq(partition)) ::+      CreatePartitionEvent("db5", "tbl1", Seq(partition)) :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = true)+    checkEvents(CreatePartitionPreEvent("db5", "tbl1", Seq(partition)) ::+      CreatePartitionEvent("db5", "tbl1", Seq(partition)) :: Nil)

I agree with you, it looks incorrect, but we can not check if the partition has been created or not created in case of using ignoreIfExists=true

ayudovin

comment created time in a month

pull request commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

Hi @dongjoon-hyun , Do you have still any suggestions or comments on this pull request?

ayudovin

comment created time in a month

pull request commentapache/spark

[SPARK-30122][K8S] Allow setting serviceAccountName for executor pods

Hi, @ayudovin .

  1. In general, new feature should not create a breaking change. For example, this PR enforce service account configuration additionally. If executor service account is not given, we had better fallback to use driver service account.
  2. Do you think we can have a test case for this? Test case is important to protect your contribution from the future changes.

Unfortunately, I can not find a way to add a unit test for it. Maybe, you can give advice or tell where I can find an example of tests for these properties.

ayudovin

comment created time in a month

pull request commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

Please don't forget the above comment.

Could you give me the real SQL statement for that?

Sorry, I have added a comment

ayudovin

comment created time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    val path1 = Files.createTempDirectory("db_")+    val path2 = Files.createTempDirectory(path1, "tbl_")+    val uri1 = preparePath(path1)+    val uri2 = preparePath(path2)++    // CREATE+    val dbDefinition = createDbDefinition(uri1)++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri2))++    val tableDefinition = CatalogTable(+      identifier = TableIdentifier("tbl1", Some("db5")),+      tableType = CatalogTableType.MANAGED,+      storage = storage,+      schema = new StructType().add("id", "long"))++    val partition = CatalogTablePartition(spec = CatalogTypes.emptyTablePartitionSpec,+      storage = CatalogStorageFormat.empty)++    catalog.createDatabase(dbDefinition, ignoreIfExists = false)+    checkEvents(CreateDatabasePreEvent("db5") :: CreateDatabaseEvent("db5") :: Nil)++    catalog.createTable(tableDefinition, ignoreIfExists = false)+    checkEvents(CreateTablePreEvent("db5", "tbl1") :: CreateTableEvent("db5", "tbl1") :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = false)+    checkEvents(CreatePartitionPreEvent("db5", "tbl1", Seq(partition)) ::+      CreatePartitionEvent("db5", "tbl1", Seq(partition)) :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = true)+    checkEvents(CreatePartitionPreEvent("db5", "tbl1", Seq(partition)) ::+      CreatePartitionEvent("db5", "tbl1", Seq(partition)) :: Nil)++    // RENAME+    val newPartition = CatalogTablePartition(spec = Map("key1" -> "1", "key2" -> "2"),+      storage = CatalogStorageFormat.empty)++    catalog.renamePartitions("db5", "tbl1", Seq(partition.spec), Seq(newPartition.spec))

So, I could not find the real SQL statement for that, that's why I changed this unit tests.

ayudovin

comment created time in a month

PullRequestEvent

PR closed apache/spark

Reviewers
[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener SQL

<!-- Thanks for sending a pull request! Here are some tips for you:

  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review. -->

What changes were proposed in this pull request?

<!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.

  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link. --> Add events to CREATE, DROP, RENAME and ALTER partitions in ExternalCatalogWithListener.

Why are the changes needed?

<!-- Please clarify why the changes are needed. For instance,

  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug. --> This changes needs for adding hooks for partitions in ExternalCatalogWithListener.

Does this PR introduce any user-facing change?

<!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No

How was this patch tested?

<!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> This functionality is covered by unit tests.

+152 -0

16 comments

3 changed files

ayudovin

pr closed time in a month

pull request commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

Please don't forget the above comment.

Could you give me the real SQL statement for that?

ayudovin

comment created time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    val path1 = Files.createTempDirectory("db_")+    val path2 = Files.createTempDirectory(path1, "tbl_")+    val uri1 = preparePath(path1)+    val uri2 = preparePath(path2)++    // CREATE+    val dbDefinition = createDbDefinition(uri1)++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri2))++    val tableDefinition = CatalogTable(+      identifier = TableIdentifier("tbl1", Some("db5")),+      tableType = CatalogTableType.MANAGED,+      storage = storage,+      schema = new StructType().add("id", "long"))++    val partition = CatalogTablePartition(spec = CatalogTypes.emptyTablePartitionSpec,+      storage = CatalogStorageFormat.empty)++    catalog.createDatabase(dbDefinition, ignoreIfExists = false)+    checkEvents(CreateDatabasePreEvent("db5") :: CreateDatabaseEvent("db5") :: Nil)++    catalog.createTable(tableDefinition, ignoreIfExists = false)+    checkEvents(CreateTablePreEvent("db5", "tbl1") :: CreateTableEvent("db5", "tbl1") :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = false)

hm, I cannot find an example, that's why I also think it isn't happening in the real world.

ayudovin

comment created time in a month

pull request commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

Hi, @ayudovin . Please check the PR content once more carefully. Thanks!

I have updated PR, could you please review it again?

ayudovin

comment created time in a month

push eventayudovin/spark

yudovin

commit sha 8da2c555d339e3a0442086dc8725f8b1aa31d0bc

Fixing comments and updating unit test

view details

push time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    val path1 = Files.createTempDirectory("db_")+    val path2 = Files.createTempDirectory(path1, "tbl_")+    val uri1 = preparePath(path1)+    val uri2 = preparePath(path2)++    // CREATE+    val dbDefinition = createDbDefinition(uri1)++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri2))++    val tableDefinition = CatalogTable(+      identifier = TableIdentifier("tbl1", Some("db5")),+      tableType = CatalogTableType.MANAGED,+      storage = storage,+      schema = new StructType().add("id", "long"))++    val partition = CatalogTablePartition(spec = CatalogTypes.emptyTablePartitionSpec,+      storage = CatalogStorageFormat.empty)++    catalog.createDatabase(dbDefinition, ignoreIfExists = false)+    checkEvents(CreateDatabasePreEvent("db5") :: CreateDatabaseEvent("db5") :: Nil)++    catalog.createTable(tableDefinition, ignoreIfExists = false)+    checkEvents(CreateTablePreEvent("db5", "tbl1") :: CreateTableEvent("db5", "tbl1") :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = false)

Despite this, I think that empty partition creation should also produce the event.

ayudovin

comment created time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    val path1 = Files.createTempDirectory("db_")+    val path2 = Files.createTempDirectory(path1, "tbl_")+    val uri1 = preparePath(path1)+    val uri2 = preparePath(path2)++    // CREATE+    val dbDefinition = createDbDefinition(uri1)++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri2))++    val tableDefinition = CatalogTable(+      identifier = TableIdentifier("tbl1", Some("db5")),+      tableType = CatalogTableType.MANAGED,+      storage = storage,+      schema = new StructType().add("id", "long"))++    val partition = CatalogTablePartition(spec = CatalogTypes.emptyTablePartitionSpec,+      storage = CatalogStorageFormat.empty)++    catalog.createDatabase(dbDefinition, ignoreIfExists = false)+    checkEvents(CreateDatabasePreEvent("db5") :: CreateDatabaseEvent("db5") :: Nil)++    catalog.createTable(tableDefinition, ignoreIfExists = false)+    checkEvents(CreateTablePreEvent("db5", "tbl1") :: CreateTableEvent("db5", "tbl1") :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = false)+    checkEvents(CreatePartitionPreEvent("db5", "tbl1", Seq(partition)) ::+      CreatePartitionEvent("db5", "tbl1", Seq(partition)) :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = true)+    checkEvents(CreatePartitionPreEvent("db5", "tbl1", Seq(partition)) ::+      CreatePartitionEvent("db5", "tbl1", Seq(partition)) :: Nil)++    // RENAME+    val newPartition = CatalogTablePartition(spec = Map("key1" -> "1", "key2" -> "2"),+      storage = CatalogStorageFormat.empty)++    catalog.renamePartitions("db5", "tbl1", Seq(partition.spec), Seq(newPartition.spec))

now it looks like key=0 to key1=1, key2=2

ayudovin

comment created time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    val path1 = Files.createTempDirectory("db_")+    val path2 = Files.createTempDirectory(path1, "tbl_")+    val uri1 = preparePath(path1)+    val uri2 = preparePath(path2)++    // CREATE+    val dbDefinition = createDbDefinition(uri1)++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri2))++    val tableDefinition = CatalogTable(+      identifier = TableIdentifier("tbl1", Some("db5")),+      tableType = CatalogTableType.MANAGED,+      storage = storage,+      schema = new StructType().add("id", "long"))++    val partition = CatalogTablePartition(spec = CatalogTypes.emptyTablePartitionSpec,+      storage = CatalogStorageFormat.empty)++    catalog.createDatabase(dbDefinition, ignoreIfExists = false)+    checkEvents(CreateDatabasePreEvent("db5") :: CreateDatabaseEvent("db5") :: Nil)++    catalog.createTable(tableDefinition, ignoreIfExists = false)+    checkEvents(CreateTablePreEvent("db5", "tbl1") :: CreateTableEvent("db5", "tbl1") :: Nil)++    catalog.createPartitions("db5", "tbl1", Seq(partition), ignoreIfExists = false)

agree with you, CatalogTypes.emptyTablePartitionSpec is not good option for unit test. I have changed it.

ayudovin

comment created time in a month

push eventayudovin/spark

yudovin

commit sha 43aeb9ba12c5dabcaddec3e84a2428ad2ebb09b6

Updating unit test

view details

push time in a month

push eventayudovin/spark

yudovin

commit sha e1220e105bc4304365ca3f232f1825b00bd6e623

Updating unit test

view details

push time in a month

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 class ExternalCatalogEventSuite extends SparkFunSuite {     catalog.dropFunction("db5", "fn4")     checkEvents(DropFunctionPreEvent("db5", "fn4") :: DropFunctionEvent("db5", "fn4") :: Nil)   }++  testWithCatalog("partition") { (catalog, checkEvents) =>+    // CREATE+    val path1 = Files.createTempDirectory("db_")+    val uri1 = preparePath(path1)++    val dbDefinition = createDbDefinition()++    val storage = CatalogStorageFormat.empty.copy(+      locationUri = Option(uri1))

yes, you are right

ayudovin

comment created time in a month

push eventayudovin/spark

zhanjf

commit sha c6ab7165dd11a0a7b8aea4c805409088e9a41a74

[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26124 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

Jungtaek Lim (HeartSaVioR)

commit sha 7bff2db9ed803e05a43c2d875c1dea819d81248a

[SPARK-21869][SS] Revise Kafka producer pool to implement 'expire' correctly This patch revises Kafka producer pool (cache) to implement 'expire' correctly. Current implementation of Kafka producer cache leverages Guava cache, which decides cached producer instance to be expired if the instance is not "accessed" from cache. The behavior defines expiration time as "last accessed time + timeout", which is incorrect because some task may use the instance longer than timeout. There's no concept of "returning" in Guava cache as well, so it cannot be fixed with Guava cache. This patch introduces a new pool implementation which tracks "reference count" of cached instance, and defines expiration time for the instance as "last returned time + timeout" if the reference count goes 0, otherwise Long.MaxValue (effectively no expire). Expiring instances will be done with evict thread explicitly instead of evicting in part of handling acquire. (It might bring more overhead, but it ensures clearing expired instances even the pool is idle.) This patch also creates a new package `producer` under `kafka010`, to hide the details from `kafka010` package. In point of `kafka010` package's view, only acquire()/release()/reset() are available in pool, and even for CachedKafkaProducer the package cannot close the producer directly. Explained above. Yes, but only for the way of expiring cached instances. (The difference is described above.) Each executor leveraging spark-sql-kafka would have one eviction thread. New and existing UTs. Closes #26845 from HeartSaVioR/SPARK-21869-revised. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

view details

Jungtaek Lim (HeartSaVioR)

commit sha 2164243526b5a2ebd1a77fe2aa43b7329b6a229c

[SPARK-28144][SPARK-29294][SS][FOLLOWUP] Use SystemTime defined in Kafka Time interface ### What changes were proposed in this pull request? This patch addresses review comments in #26960 (https://github.com/apache/spark/pull/26960#discussion_r360661930 / https://github.com/apache/spark/pull/26960#discussion_r360661947) which were not addressed in the patch. Addressing these review comments will let the code less dependent on actual implementation as it only relies on `Time` interface in Kafka. ### Why are the changes needed? These were review comments in previous PR and they bring actual benefit though they're minors. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26979 from HeartSaVioR/SPARK-29294-follow-up. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Maxim Gekk

commit sha ab0dd41ff2c05134266bd14d96676c96aa2764cd

[SPARK-26618][SQL][FOLLOWUP] Update the SQL migration guide regarding to typed `TIMESTAMP` and `DATE` literals ### What changes were proposed in this pull request? In the PR, I propose to update the SQL migration guide and clarify semantic of string conversion to typed `TIMESTAMP` and `DATE` literals. ### Why are the changes needed? This is a follow-up of the PR https://github.com/apache/spark/pull/23541 which changed the behavior of `TIMESTAMP`/`DATE` literals, and can impact on results of user's queries. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It should be checked by jenkins build. Closes #26985 from MaxGekk/timestamp-date-constructors-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Wenchen Fan

commit sha ba3f6330dd2b6054988f1f6f0ffe014fc4969088

Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component" This reverts commit c6ab7165dd11a0a7b8aea4c805409088e9a41a74.

view details

Pavithra Ramachandran

commit sha 57ca95246cca69e4f6a847fff08005c921430ace

[SPARK-29505][SQL] Make DESC EXTENDED <table name> <column name> case insensitive ### What changes were proposed in this pull request? While querying using **desc** , if column name is not entered exactly as per the column name given during the table creation, the colstats are wrong. fetching of col stats has been made case insensitive. ### Why are the changes needed? functions like **analyze**, etc support case insensitive retrieval of column data. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? <!-- Unit test has been rewritten and tested. Closes #26927 from PavithraRamachandran/desc_caseinsensitive. Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

view details

zhengruifeng

commit sha 5715a84c4059141e22d1e6981ab161aef5b38f00

[SPARK-29914][ML][FOLLOWUP] fix SQLTransformer & VectorSizeHint toString method ### What changes were proposed in this pull request? 1,modify the toString in SQLTransformer & VectorSizeHint 2,add toString in RegexTokenizer ### Why are the changes needed? in SQLTransformer & VectorSizeHint , `toString` methods directly call getter of param without default values. This will cause `java.util.NoSuchElementException` in REPL: ```scala scala> val vs = new VectorSizeHint() java.util.NoSuchElementException: Failed to find a default value for size at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:780) at scala.Option.getOrElse(Option.scala:189) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26999 from zhengruifeng/fix_toString. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

view details

zhengruifeng

commit sha 8f07839e743889bbffe91302f31388ab892bf08a

[SPARK-30178][ML] RobustScaler support large numFeatures ### What changes were proposed in this pull request? compute the medians/ranges more distributedly ### Why are the changes needed? It is a bottleneck to collect the whole Array[QuantileSummaries] from executors, since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`). In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM. After this PR, it will sucessfuly fit the model. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26803 from zhengruifeng/robust_high_dim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

view details

Liang-Chi Hsieh

commit sha 0042ad575a0907ba36ae560e4fb6f01ae071a3a3

[SPARK-30290][CORE] Count for merged block when fetching continuous blocks in batch ### What changes were proposed in this pull request? We added shuffle block fetch optimization in SPARK-9853. In ShuffleBlockFetcherIterator, we merge single blocks into batch blocks. During merging, we should count merged blocks for `maxBlocksInFlightPerAddress`, not original single blocks. ### Why are the changes needed? If `maxBlocksInFlightPerAddress` is specified, like set it to 1, it should mean one batch block, not one original single block. Otherwise, it will conflict with batch shuffle fetch. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #26930 from viirya/respect-max-blocks-inflight. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

manuzhang

commit sha ef6f9e966846a9885b048a68b1bbd37afb032aa6

[SPARK-30331][SQL] Set isFinalPlan to true before posting the final AdaptiveSparkPlan event ### What changes were proposed in this pull request? Set `isFinalPlan=true` before posting the final AdaptiveSparkPlan event (`SparkListenerSQLAdaptiveExecutionUpdate`) ### Why are the changes needed? Otherwise, any attempt to listen on the final event by pattern matching `isFinalPlan=true` would fail ### Does this PR introduce any user-facing change? No. ### How was this patch tested? All tests in `AdaptiveQueryExecSuite` are exteneded with a verification that a `SparkListenerSQLAdaptiveExecutionUpdate` event with `isFinalPlan=True` exists Closes #26983 from manuzhang/spark-30331. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

yi.wu

commit sha 35506dced739ef16136e9f3d5d48c638899d3cec

[SPARK-25855][CORE][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Change config name from `spark.eventLog.allowErasureCoding` to `spark.eventLog.allowErasureCoding.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Tested manually and pass Jenkins. Closes #26998 from Ngone51/SPARK-25855-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Kent Yao

commit sha da65a955ed61a5f82181ea051959e91de884efcc

[SPARK-30266][SQL] Avoid match error and int overflow in ApproximatePercentile and Percentile ### What changes were proposed in this pull request? accuracyExpression can accept Long which may cause overflow error. accuracyExpression can accept fractions which are implicitly floored. accuracyExpression can accept null which is implicitly changed to 0. percentageExpression can accept null but cause MatchError. percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros. ##### cases ```sql select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success select percentile_approx(10.0, 0.5, null); -- null cast to 0 select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1 select percentile_approx(10.0, null, 1); -- scala.MatchError select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero. ``` ##### behavior before ```sql +select percentile_approx(10.0, 0.5, 2147483648) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +10.0 + +select percentile_approx(10.0, null, 1) +scala.MatchError +null + + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +[10.0,10.0,10.0] ``` ##### behavior after ```sql +select percentile_approx(10.0, 0.5, 2147483648) +10.0 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7 + +select percentile_approx(10.0, null, 1) +java.lang.IllegalArgumentException +The value of percentage must be be between 0.0 and 1.0, but got null + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +java.lang.IllegalArgumentException +Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null] ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, fix some improper usages of percentile_approx as cases list above ### How was this patch tested? add ut Closes #26905 from yaooqinn/SPARK-30266. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

yi.wu

commit sha 6d64fc2407e5b21a2db59c5213df438c74a31637

[SPARK-26389][SS][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Rename `spark.sql.streaming.forceDeleteTempCheckpointLocation` to `spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, as this config is newly added in 3.0. ### How was this patch tested? Pass Jenkins. Closes #26981 from Ngone51/SPARK-26389-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

zhengruifeng

commit sha ad77b400da4089a2de74394e2b8aed813633025a

[SPARK-30347][ML] LibSVMDataSource attach AttributeGroup ### What changes were proposed in this pull request? LibSVMDataSource attach AttributeGroup ### Why are the changes needed? LibSVMDataSource will attach a special metadata to indicate numFeatures: ```scala scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") scala> data.schema("features").metadata res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4} ``` However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata: ```scala scala> import org.apache.spark.ml.attribute._ import org.apache.spark.ml.attribute._ scala> AttributeGroup.fromStructField(data.schema("features")).size res1: Int = -1 ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? added tests Closes #27003 from zhengruifeng/libsvm_attr_group. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

view details

wenfang

commit sha 4d58cd77f99c7f83a4e8c94a36b6af5ef991cda0

[SPARK-30330][SQL] Support single quotes json parsing for get_json_object and json_tuple ### What changes were proposed in this pull request? I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results. In my production environment, ytag is a json wrapped by single quotes,as follows ``` {'y1': 'shuma', 'y2': 'shuma:shouji'} {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} {'y1': 'yule', 'y2': 'yule:mingxing'} ``` Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing. So l provide this PR to resolve the question. ### Why are the changes needed? Enabled for Hive compatibility ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NEW TESTS Closes #26965 from wenfang6/enableSingleQuotesJsonForSparkSQL. Authored-by: wenfang <wenfang@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Jungtaek Lim (HeartSaVioR)

commit sha 481fb63f97d87d5b2e9e1f9b30bee466605b5a72

[MINOR][SQL][SS] Remove TODO comments as var in case class is discouraged but worth breaking it ### What changes were proposed in this pull request? This patch removes TODO comments which are left to address changing case classes having vars to normal classes in spark-sql-kafka module - the pattern is actually discouraged, but still worth to break it, as we already use automatic toString implementation and we may be using more. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26992 from HeartSaVioR/SPARK-30337. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

gengjiaan

commit sha d59e7195f6051820c95f84c83b01148975412d85

[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression ### What changes were proposed in this pull request? The filter predicate for aggregate expression is an `ANSI SQL`. ``` <aggregate function> ::= COUNT <left paren> <asterisk> <right paren> [ <filter clause> ] | <general set function> [ <filter clause> ] | <binary set function> [ <filter clause> ] | <ordered set function> [ <filter clause> ] | <array aggregate function> [ <filter clause> ] | <row pattern count function> [ <filter clause> ] ``` There are some mainstream database support this syntax. **PostgreSQL:** https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES For example: ``` SELECT year, count(*) FILTER (WHERE gdp_per_capita >= 40000) FROM countries GROUP BY year ``` ``` SELECT year, code, gdp_per_capita, count(*) FILTER (WHERE gdp_per_capita >= 40000) OVER (PARTITION BY year) FROM countries ``` **jOOQ:** https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/ **Notice:** 1.This PR only supports FILTER predicate without codegen. maropu will create another PR is related to SPARK-30027 to support codegen. 2.This PR only supports FILTER predicate without DISTINCT. I will create another PR is related to SPARK-30276 to support this. 3.This PR only supports FILTER predicate that can't reference the outer query. I created ticket SPARK-30219 to support it. 4.This PR only supports FILTER predicate that can't use IN/EXISTS predicate sub-queries. I created ticket SPARK-30220 to support it. 5.Spark SQL cannot supports a SQL with nested aggregate. I created ticket SPARK-30182 to support it. There are some show of the PR on my production environment. ``` spark-sql> desc gja_test_partition; key string NULL value string NULL other string NULL col2 int NULL # Partition Information # col_name data_type comment col2 int NULL Time taken: 0.79 s ``` ``` spark-sql> select * from gja_test_partition; a A ao 1 b B bo 1 c C co 1 d D do 1 e E eo 2 g G go 2 h H ho 2 j J jo 2 f F fo 3 k K ko 3 l L lo 4 i I io 4 Time taken: 1.75 s ``` ``` spark-sql> select count(key), sum(col2) from gja_test_partition; 12 26 Time taken: 1.848 s ``` ``` spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition; 8 Time taken: 2.926 s ``` ``` spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition; 14 Time taken: 2.087 s ``` ``` spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition; 8 14 Time taken: 2.847 s ``` ``` spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition; 12 8 26 14 Time taken: 1.787 s ``` ``` spark-sql> desc student; id int NULL name string NULL sex string NULL class_id int NULL Time taken: 0.206 s ``` ``` spark-sql> select * from student; 1 张三 man 1 2 李四 man 1 3 王五 man 2 4 赵六 man 2 5 钱小花 woman 1 6 赵九红 woman 2 7 郭丽丽 woman 2 Time taken: 0.786 s ``` ``` spark-sql> select class_id, count(id), sum(id) from student group by class_id; 1 3 8 2 4 20 Time taken: 18.783 s ``` ``` spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id; 1 2 5 2 2 13 Time taken: 3.887 s ``` ### Why are the changes needed? Add new SQL feature. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT and new UT. Closes #26656 from beliefer/support-aggregate-clause. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Kengo Seki

commit sha 59c014e120adbaa1c7f117c6fb436a523da37176

[SPARK-30350][SQL] Fix ScalaReflection to use an empty array for getting its class object ### What changes were proposed in this pull request? This PR fixes `ScalaReflection.arrayClassFor()` to use an empty array instead of a one-element array for getting its class object by reflection. ### Why are the changes needed? Because it may reduce unnecessary memory allocation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran the existing unit tests for sql/catalyst and confirmed that all of them succeeded. Closes #27005 from sekikn/SPARK-30350. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

view details

Fu Chen

commit sha 3584d849438ad48ff54af3c982c124a8443dc590

[MINOR][CORE] Quiet request executor remove message ### What changes were proposed in this pull request? Settings to quiet for Class `ExecutorAllocationManager` that request message too verbose. otherwise, this class generates too many messages like `INFO spark.ExecutorAllocationManager: Request to remove executorIds: 890` when we enabled DRA. ### Why are the changes needed? Log level improvement. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Closes #26925 from cfmcgrady/quiet-request-executor-remove-message. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

zhanjf

commit sha 8d3eed33eee0d6e5d2e4d4907869ca6226d9e37f

[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #27000 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

push time in 2 months

push eventayudovin/spark

yudovin

commit sha 3cb79fe63a14ecba62bfc9b449ecf9b4a28b6e10

Putting extends PartitionEvent into a new line

view details

push time in 2 months

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 case class RenameFunctionEvent(     newName: String)   extends FunctionEvent -trait PartitionEvent extends DatabaseEvent {-  /**-   * Name of the table that was touched.-   */-  val name: String-}-+/**+ * Event fired when a partition is created, dropped, altered or renamed.+ */+trait PartitionEvent extends TableEvent  /**  * Event fired before a partition is created.  */-case class CreatePartitionPreEvent(database: String, name: String,-                                   parts: Seq[CatalogTablePartition]) extends PartitionEvent+case class CreatePartitionPreEvent(+    database: String,+    name: String) extends PartitionEvent

shall we put extends PartitionEvent into a new line

yes, I'll put it into a new line.

ayudovin

comment created time in 2 months

push eventayudovin/spark

yudovin

commit sha 0c5edc2314ef9d7c23d759735f44610f90655312

Applying changes by suggestions

view details

push time in 2 months

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 case class RenameFunctionEvent(     newName: String)   extends FunctionEvent -trait PartitionEvent extends DatabaseEvent {-  /**-   * Name of the table that was touched.-   */-  val name: String-}-+/**+ * Event fired when a partition is created, dropped, altered or renamed.+ */+trait PartitionEvent extends TableEvent  /**  * Event fired before a partition is created.  */-case class CreatePartitionPreEvent(database: String, name: String,-                                   parts: Seq[CatalogTablePartition]) extends PartitionEvent+case class CreatePartitionPreEvent(+    database: String,+    name: String) extends PartitionEvent

Do you mean to create such a way:

 case class CreatePartitionPreEvent(
    database: String,
    name: String,
    parts: Seq[CatalogTablePartition]) extends PartitionEvent

but PartitionEvent remains empty ?

ayudovin

comment created time in 2 months

push eventayudovin/spark

yudovin

commit sha af04a625698d7f6ea2390d8286bcca8db7fdef6f

Applying changes by suggestions

view details

push time in 2 months

Pull request review commentapache/spark

[SPARK-30244][SQL] Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

 case class RenameFunctionEvent(     name: String,     newName: String)   extends FunctionEvent++trait PartitionEvent extends DatabaseEvent {+  /**+   * Name of the table that was touched.+   */+  val name: String

Does it make sense to store parts: Seq[TablePartitionSpec]? CatalogTablePartition contains TablePartitionSpec inside.

  • Create and Alter partition have as parameter Seq[CatalogTablePartition]
  • Rename and Drop partition have as parameter Seq[TablePartitionSpec]
ayudovin

comment created time in 2 months

pull request commentapache/spark

[SPARK-30244][SQL][Catalyst] - Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

@hvanhovell, Could you please review this pull request?

ayudovin

comment created time in 2 months

pull request commentapache/spark

[SPARK-30122][Resource-Manager][Kubernetes] - Allow setting serviceAccountName for executor pods

@liyinan926, Could you please review this pull request?

ayudovin

comment created time in 2 months

push eventayudovin/spark

yudovin

commit sha ba84321b31cc5c022022d5f3491d747939bc254a

Cleaning code

view details

yudovin

commit sha 25595b835e1e0a9004d81e40d13b00e35fc4b60a

Cleaning code

view details

push time in 2 months

PR opened apache/spark

[SPARK-30122] - Allow setting serviceAccountName for executor pods

<!-- Thanks for sending a pull request! Here are some tips for you:

  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review. -->

What changes were proposed in this pull request?

<!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.

  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link. --> Currently, it doesn't seem to be possible to have Spark Driver set the serviceAccountName for executor pods it launches.

Why are the changes needed?

<!-- Please clarify why the changes are needed. For instance,

  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug. --> it will allow settings serviceAccountName for executors pods.

Does this PR introduce any user-facing change?

<!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No

How was this patch tested?

<!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> It was tested manually.

+13 -5

0 comment

3 changed files

pr created time in 2 months

create barnchayudovin/spark

branch : srevice-account-name-for-executor-pods

created branch time in 2 months

push eventayudovin/spark

shahid

commit sha 91b83de4173839d559bee582f5569dc9857ab648

[SPARK-30086][SQL][TESTS] Run HiveThriftServer2ListenerSuite on a dedicated JVM to fix flakiness ### What changes were proposed in this pull request? This PR tries to fix flakiness in `HiveThriftServer2ListenerSuite` by using a dedicated JVM (after we switch to Hive 2.3 by default in PR builders). Likewise in https://github.com/apache/spark/commit/4a73bed3180aeb79c92bb19aea2ac5a97899731a, there's no explicit evidence for this fix. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114653/testReport/org.apache.spark.sql.hive.thriftserver.ui/HiveThriftServer2ListenerSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ ``` sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.LinkageError: loader constraint violation: loader (instance of net/bytebuddy/dynamic/loading/MultipleParentClassLoader) previously initiated loading for a different type with name "org/apache/hive/service/ServiceStateChangeListener" at org.mockito.codegen.HiveThriftServer2$MockitoMock$1974707245.<clinit>(Unknown Source) at sun.reflect.GeneratedSerializationConstructorAccessor164.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:48) at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73) at org.mockito.internal.creation.instance.ObjenesisInstantiator.newInstance(ObjenesisInstantiator.java:19) at org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:47) at org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25) at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35) at org.mockito.internal.MockitoCore.mock(MockitoCore.java:62) at org.mockito.Mockito.mock(Mockito.java:1908) at org.mockito.Mockito.mock(Mockito.java:1880) at org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite.createAppStatusStore(HiveThriftServer2ListenerSuite.scala:156) at org.apache.spark.sql.hive.thriftserver.ui.HiveThriftServer2ListenerSuite.$anonfun$new$3(HiveThriftServer2ListenerSuite.scala:47) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) ``` ### Why are the changes needed? To make test cases more robust. ### Does this PR introduce any user-facing change? No (dev only). ### How was this patch tested? Jenkins build. Closes #26720 from shahidki31/mock. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Liu,Linhong

commit sha f22177c9576dd148d958383759f9e812413e455d

[SPARK-29486][SQL][FOLLOWUP] Document the reason to add days field ### What changes were proposed in this pull request? Follow up of #26134 to document the reason to add days filed and explain how do we use it ### Why are the changes needed? only comment ### Does this PR introduce any user-facing change? no ### How was this patch tested? no need test Closes #26701 from LinhongLiu/spark-29486-followup. Authored-by: Liu,Linhong <liulinhong@baidu.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

huangtianhua

commit sha 700a2edbd10a7b64ed92895b2ce85e6687136e7b

[SPARK-30057][DOCS] Add a statement of platforms Spark runs on Closes #26690 from huangtianhua/add-note-spark-runs-on-arm64. Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

HyukjinKwon

commit sha f32ca4b2799d028620d010f8c9c4f7c09f4ad864

[SPARK-30076][BUILD][TESTS] Upgrade Mockito to 3.1.0 ### What changes were proposed in this pull request? We used 2.28.2 of Mockito as of https://github.com/apache/spark/pull/25139 because 3.0.0 might be unstable. Now 3.1.0 is released. See release notes - https://github.com/mockito/mockito/blob/v3.1.0/doc/release-notes/official.md ### Why are the changes needed? To bring the fixes made in the dependency. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Jenkins will test. Closes #26707 from HyukjinKwon/upgrade-Mockito. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Yuming Wang

commit sha 708ab57f377bfd8e71183cfead918bae5b811946

[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. **Spark SQL**: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ | CAST(1 AS DECIMAL(38,18)) | +----------------------------+--+ | 1.000000000000000000 | +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ |1.000000000000000000 | +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ ``` **PostgreSQL**: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` **Presto**: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #26697 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

wuyi

commit sha 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f

[SPARK-29956][SQL] A literal number with an exponent should be parsed to Double ### What changes were proposed in this pull request? For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior. ### Why are the changes needed? According to ANSI standard of SQL, we see that the (part of) definition of `literal` : ``` <approximate numeric literal> ::= <mantissa> E <exponent> ``` which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal). And when we test Presto, we found that Presto also conforms to this standard: ``` presto:default> select typeof(1E2); _col0 -------- double (1 row) ``` ``` presto:default> select typeof(1.2); _col0 -------------- decimal(2,1) (1 row) ``` We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to *The difference between the two confuses most users* as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment). Although, we also see that PostgreSQL has its own implementation: ``` postgres=# select pg_typeof(1E2); pg_typeof ----------- numeric (1 row) postgres=# select pg_typeof(1.2); pg_typeof ----------- numeric (1 row) ``` We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience. ### Does this PR introduce any user-facing change? Yes. For `1E2`, before this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)] ``` After this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [100.0: double] ``` And for `1E-45`, before this PR: ``` org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38 == SQL == select 1E-45 at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided ``` after this PR: ``` scala> spark.sql("select 1E-45"); res1: org.apache.spark.sql.DataFrame = [1.0E-45: double] ``` And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well. ### How was this patch tested? updated `literals.sql.out` and `ansi/literals.sql.out` Closes #26595 from Ngone51/SPARK-29956. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Terry Kim

commit sha 5a1896adcb87e1611559c55fc76f32063e1c7c1b

[SPARK-30065][SQL] DataFrameNaFunctions.drop should handle duplicate columns ### What changes were proposed in this pull request? `DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified. ```Scala val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") val df = left.join(right, Seq("col1")) df.printSchema df.na.drop("any").show ``` produces ``` root |-- col1: string (nullable = true) |-- col2: string (nullable = true) |-- col2: string (nullable = true) org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.; at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) ``` The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity. This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity). ### Why are the changes needed? If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns. ### Does this PR introduce any user-facing change? Yes, now all the rows with nulls are dropped in the above example: ``` scala> df.na.drop("any").show +----+----+----+ |col1|col2|col2| +----+----+----+ +----+----+----+ ``` ### How was this patch tested? Added new unit tests. Closes #26700 from imback82/na_drop. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Yuanjian Li

commit sha d1465a1b0dea690fcfbf75edb73ff9f8a015c0dd

[SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey reducePostShufflePartitions enabled ### What changes were proposed in this pull request? 1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config. 2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`. ### Why are the changes needed? Make the relation between these confs clearer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26664 from xuanyuanking/SPARK-9853-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

HyukjinKwon

commit sha 51e69feb495dfc63023ff673da30a3198081cfb6

[SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map ### What changes were proposed in this pull request? This PR proposes to use foreach instead of misusing map as a small followup of #26476. This could cause some weird errors potentially and it's not a good practice anyway. See also SPARK-16694 ### Why are the changes needed? To avoid potential issues like SPARK-16694 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover. Closes #26729 from HyukjinKwon/SPARK-29851. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Liang-Chi Hsieh

commit sha 85cb388ae3f25b0e6a7fc1a2d78fd1c3ec03f341

[SPARK-30050][SQL] analyze table and rename table should not erase hive table bucketing info ### What changes were proposed in this pull request? This patch adds Hive provider into table metadata in `HiveExternalCatalog.alterTableStats`. When we call `HiveClient.alterTable`, `alterTable` will erase if it can not find hive provider in given table metadata. Rename table also has this issue. ### Why are the changes needed? Because running `ANALYZE TABLE` on a Hive table, if the table has bucketing info, will erase existing bucket info. ### Does this PR introduce any user-facing change? Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase existing bucketing info. ### How was this patch tested? Unit test. Closes #26685 from viirya/fix-hive-bucket. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

zhengruifeng

commit sha 03ac1b799cf1e48489e8246a1b97110c80344160

[SPARK-29959][ML][PYSPARK] Summarizer support more metrics ### What changes were proposed in this pull request? Summarizer support more metrics: sum, std ### Why are the changes needed? Those metrics are widely used, it will be convenient to directly obtain them other than a conversion. in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std ### Does this PR introduce any user-facing change? yes, new metrics are exposed to end users ### How was this patch tested? added testsuites Closes #26596 from zhengruifeng/summarizer_add_metrics. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

view details

Yuanjian Li

commit sha 169415ffac3050a86934011525ea00eef7fca35c

[SPARK-30025][CORE] Continuous shuffle block fetching should be disabled by default when the old fetch protocol is used ### What changes were proposed in this pull request? Disable continuous shuffle block fetching when the old fetch protocol in use. ### Why are the changes needed? The new feature of continuous shuffle block fetching depends on the latest version of the shuffle fetch protocol. We should keep this constraint in `BlockStoreShuffleReader.fetchContinuousBlocksInBatch`. ### Does this PR introduce any user-facing change? Users will not get the exception related to continuous shuffle block fetching when old version of the external shuffle service is used. ### How was this patch tested? Existing UT. Closes #26663 from xuanyuanking/SPARK-30025. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

LantaoJin

commit sha 04a5b8f5f80ee746bdc16267e44a993a9941d335

[SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE ### What changes were proposed in this pull request? In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`. Hive support `STORED AS` new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`. ### Why are the changes needed? See https://github.com/apache/spark/pull/26097#issue-327424759 ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat]; ### How was this patch tested? Add UTs. Closes #26466 from LantaoJin/SPARK-29839. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Kent Yao

commit sha 4e073f3c5093e136518e456d0a3a7437ad9867a3

[SPARK-30047][SQL] Support interval types in UnsafeRow ### What changes were proposed in this pull request? Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance. ### Why are the changes needed? improve aggerates ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut and existing ones Closes #26680 from yaooqinn/SPARK-30047. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Wenchen Fan

commit sha e271664a01fd7dee63391890514d76262cad1bc1

[MINOR][SQL] Rename config name to spark.sql.analyzer.failAmbiguousSelfJoin.enabled ### What changes were proposed in this pull request? add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`. ### Why are the changes needed? to follow the existing naming style ### Does this PR introduce any user-facing change? no ### How was this patch tested? not needed Closes #26694 from cloud-fan/conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Jungtaek Lim (HeartSaVioR)

commit sha 54edaee58654bdc3c961906a8390088f35460ae9

[MINOR][SS] Add implementation note on overriding serialize/deserialize in HDFSMetadataLog methods' scaladoc ### What changes were proposed in this pull request? The patch adds scaladoc on `HDFSMetadataLog.serialize` and `HDFSMetadataLog.deserialize` for adding implementation note when overriding - HDFSMetadataLog calls `serialize` and `deserialize` inside try-finally and caller will do the resource (input stream, output stream) cleanup, so resource cleanup should not be performed in these methods, but there's no note on this (only code comment, not scaladoc) which is easy to be missed. ### Why are the changes needed? Contributors who are unfamiliar with the intention seem to think it as a bug if the resource is not cleaned up in serialize/deserialize of subclass of HDFSMetadataLog, and they couldn't know about the intention without reading the code of HDFSMetadataLog. Adding the note as scaladoc would expand the visibility. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Just a doc change. Closes #26732 from HeartSaVioR/MINOR-SS-HDFSMetadataLog-serde-scaladoc. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: dz <953396112@qq.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

huangtianhua

commit sha e842033accf12190f1bf3962546065613656410f

[SPARK-27721][BUILD] Switch to use right leveldbjni according to the platforms This change adds a profile to switch to use the right leveldbjni package according to the platforms: aarch64 uses org.openlabtesting.leveldbjni:leveldbjni-all.1.8, and other platforms use the old one org.fusesource.leveldbjni:leveldbjni-all.1.8. And because some hadoop dependencies packages are also depend on org.fusesource.leveldbjni:leveldbjni-all, but hadoop merge the similar change on trunk, details see https://issues.apache.org/jira/browse/HADOOP-16614, so exclude the dependency of org.fusesource.leveldbjni for these hadoop packages related. Then Spark can build/test on aarch64 platform successfully. Closes #26636 from huangtianhua/add-aarch64-leveldbjni. Authored-by: huangtianhua <huangtianhua@huawei.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Huaxin Gao

commit sha babefdee1c133c6b35ff026d5deacb292a0b85aa

[SPARK-30085][SQL][DOC] Standardize sql reference ### What changes were proposed in this pull request? Standardize sql reference ### Why are the changes needed? To have consistent docs ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Tested using jykyll build --serve Closes #26721 from huaxingao/spark-30085. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Jungtaek Lim (HeartSaVioR)

commit sha e04a63437b8f31db90ca1669ee98289f4ba633e1

[SPARK-30075][CORE][TESTS] Fix the hashCode implementation of ArrayKeyIndexType correctly ### What changes were proposed in this pull request? This patch fixes the bug on ArrayKeyIndexType.hashCode() as it is simply calling Array.hashCode() which in turn calls Object.hashCode(). That should be Arrays.hashCode() to reflect the elements in the array. ### Why are the changes needed? I've encountered the bug in #25811 while adding test codes for #25811, and I've split the fix into individual PR to speed up reviewing. Without this patch, ArrayKeyIndexType would bring various issues when it's used as type of collections. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've skipped adding UT as ArrayKeyIndexType is in test and the patch is pretty simple one-liner. Closes #26709 from HeartSaVioR/SPARK-30075. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Ali Afroozeh

commit sha 68034a805607ced50dbedca73dfc7eaf0102dde8

[SPARK-30072][SQL] Create dedicated planner for subqueries ### What changes were proposed in this pull request? This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent. As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes: ``` promote_precision(1.7) * promote_precision(avg(a)) ``` After the optimization, more specifically the constant folding rule, this expression becomes: ``` 1.7 * promote_precision(avg(a)) ``` Now if we run the analyzer on this optimized query again, we will get: ``` promote_precision(1.7) * promote_precision(promote_precision(avg(a))) ``` Which will later optimized as: ``` 1.7 * promote_precision(promote_precision(avg(a))) ``` As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan. We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced. ### How was this patch tested? Unit tests Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>

view details

push time in 2 months

PR opened apache/spark

[SPARK-30244] - Emit pre/post events for "Partition" methods in ExternalCatalogWithListener

<!-- Thanks for sending a pull request! Here are some tips for you:

  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review. -->

What changes were proposed in this pull request?

<!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.

  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link. --> Add events to CREATE, DROP, RENAME and ALTER partitions in ExternalCatalogWithListener.

Why are the changes needed?

<!-- Please clarify why the changes are needed. For instance,

  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug. --> This changes needs for adding hooks for partitions in ExternalCatalogWithListener.

Does this PR introduce any user-facing change?

<!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No

How was this patch tested?

<!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> This functionality is covered by unit tests.

+129 -0

0 comment

3 changed files

pr created time in 2 months

create barnchayudovin/spark

branch : partition-events

created branch time in 2 months

Pull request review commentapache/spark

[SPARK-27523][CORE] - Resolve scheme-less event log directory relative to default filesystem

 class SparkContext(config: SparkConf) extends Logging {       files.foreach(addFile)     } +    _eventLogDir =+      if (isEventLogEnabled) {+        val defaultFSProperty = _hadoopConfiguration.get("fs.defaultFS")+        val defaultFS = if (defaultFSProperty == null) "" else defaultFSProperty++        val unresolvedDir = s"$defaultFS${conf.get(EVENT_LOG_DIR).stripSuffix("/")}"

agree with you, I have added changes by your suggestions.

ayudovin

comment created time in 3 months

push eventayudovin/spark

yudovin

commit sha 622f90ddb4758633032492620bdfe57687352785

Adding unit test and handle case when scheme specified in eventlog.dir

view details

push time in 3 months

PR opened apache/spark

[27523] - Resolve scheme-less event log directory relative to default filesystem

<!-- Thanks for sending a pull request! Here are some tips for you:

  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review. -->

What changes were proposed in this pull request?

evenlog.dir is resolved with respect to the default filesystem.

Why are the changes needed?

These changes needed for resolved evelog.dir when defaultFS is specified.

Does this PR introduce any user-facing change?

no

How was this patch tested?

this patch was tested manually.

+11 -8

0 comment

1 changed file

pr created time in 3 months

create barnchayudovin/spark

branch : log-directory

created branch time in 3 months

push eventayudovin/spark

ulysses

commit sha 8a8ac002716d000431b4c8ff582fc1f56740aecc

[SPARK-29687][SQL] Fix JDBC metrics counter data type ### What changes were proposed in this pull request? Fix JDBC metrics counter data type. Related pull request [26109](https://github.com/apache/spark/pull/26109). ### Why are the changes needed? Avoid overflow. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Exists UT. Closes #26346 from ulysses-you/SPARK-29687. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

view details

Terry Kim

commit sha 3175f4bf1be9a847675b9123c6ee505bd446b419

[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala ### What changes were proposed in this pull request? This PR changes the behavior of `Column.getItem` to call `Column.getItem` on Scala side instead of `Column.apply`. ### Why are the changes needed? The current behavior is not consistent with that of Scala. In PySpark: ```Python df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col.getItem(col('id'))).show() # +---+------+ # | id|mapped| # +---+------+ # | 0| 100| # | 1| 200| # +---+------+ ``` In Scala: ```Scala val df = spark.range(2) val map_col = map(lit(0), lit(100), lit(1), lit(200)) // The following getItem results in the following exception, which is the right behavior: // java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Column id // at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) // at org.apache.spark.sql.Column.getItem(Column.scala:856) // ... 49 elided df.withColumn("mapped", map_col.getItem(col("id"))).show ``` ### Does this PR introduce any user-facing change? Yes. If the use wants to pass `Column` object to `getItem`, he/she now needs to use the indexing operator to achieve the previous behavior. ```Python df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col[col('id'))].show() # +---+------+ # | id|mapped| # +---+------+ # | 0| 100| # | 1| 200| # +---+------+ ``` ### How was this patch tested? Existing tests. Closes #26351 from imback82/spark-29664. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Huaxin Gao

commit sha ae7450d1c977bcfde1d67745440079abbeb19c67

[SPARK-29676][SQL] ALTER TABLE (RENAME PARTITION) should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add AlterTableRenamePartitionStatement and make ALTER TABLE ... RENAME TO PARTITION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2) // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE ... RENAME TO PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26350 from huaxingao/spark_29676. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>

view details

zhengruifeng

commit sha 8a4378c6f0ca49f94fdeba08ca095721427e8635

[SPARK-29686][ML] LinearSVC should persist instances if needed ### What changes were proposed in this pull request? persist the input if needed ### Why are the changes needed? training with non-cached dataset will hurt performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26344 from zhengruifeng/linear_svc_cache. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

view details

Liu,Linhong

commit sha a4382f7fe1c36a51c64f460c6cb91e93470e0825

[SPARK-29486][SQL] CalendarInterval should have 3 fields: months, days and microseconds ### What changes were proposed in this pull request? Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it to 3 fields: months, days and microseconds. This is because one logical day interval may have different number of microseconds (daylight saving). ### Why are the changes needed? One logical day interval may have different number of microseconds (daylight saving). For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to 2019-11-3 12:00:00 ### Does this PR introduce any user-facing change? no ### How was this patch tested? unit test and new added test cases Closes #26134 from LinhongLiu/calendarinterval. Authored-by: Liu,Linhong <liulinhong@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Huaxin Gao

commit sha 14337f68e328c3faad81d3051a56be080a31509d

[SPARK-29643][SQL] ALTER TABLE/VIEW (DROP PARTITION) should look up catalog/table like v2 commands ###What changes were proposed in this pull request? Add AlterTableDropPartitionStatement and make ALTER TABLE/VIEW ... DROP PARTITION go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ALTER TABLE t DROP PARTITION (id=1) // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running ALTER TABLE/VIEW ... DROP PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26303 from huaxingao/spark-29643. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

DylanGuedes

commit sha f53be0a05e04c83a31420e5b3feaf4604f67481c

[SPARK-29109][SQL][TESTS] Port window.sql (Part 3) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql#L564-L911 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ### Why are the changes needed? To ensure compatibility with PostgreSQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the Jenkins. And, Comparison with PgSQL results. Closes #26274 from DylanGuedes/spark-29109. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

view details

root1

commit sha 39fff9258af846cca2dd82816f0e2cb7d416a2c4

[SPARK-29452][WEBUI] Improve Storage tab tooltip ### What changes were proposed in this pull request? Added Tootips for each column in storage tab of Web UI. ### Why are the changes needed? Tooltips will help users in understanding columns of storage tabs. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Manually Tested. Closes #26226 from iRakson/storage_tooltip. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Matt Stillwell

commit sha 1e1b7302f482a3b81e1fcd7060b4849a488376bf

[MINOR][PYSPARK][DOCS] Fix typo in example documentation ### What changes were proposed in this pull request? I propose that we change the example code documentation to call the proper function . For example, under the `foreachBatch` function, the example code was calling the `foreach()` function by mistake. ### Why are the changes needed? I suppose it could confuse some people, and it is a typo ### Does this PR introduce any user-facing change? No, there is no "meaningful" code being change, simply the documentation ### How was this patch tested? I made the change on a fork and it still worked Closes #26299 from mstill3/patch-1. Authored-by: Matt Stillwell <18670089+mstill3@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

dengziming

commit sha 28ccd31aee7e1d01ecf50cf5adbd88f0af79fdce

[SPARK-29611][WEBUI] Sort Kafka metadata by the number of messages ### What changes were proposed in this pull request? Sort metadata by the number of messages in each Kafka partition ### Why are the changes needed? help to find the data skewness problem. ### Does this PR introduce any user-facing change? Yes, add a column count to the metadata and sort by count ![image](https://user-images.githubusercontent.com/26023240/67617886-63e06800-f81a-11e9-8718-be3a0100952e.png) If you set `minPartitions` configurations with structure structured-streaming which doesn't have the Streaming page, my code changes in `DirectKafkaInputDStream` won't affect the WEB UI page just as it shows in the follow image ![image](https://user-images.githubusercontent.com/26023240/68020762-79520800-fcda-11e9-96cd-f0c64a36f505.png) ### How was this patch tested? Manual test Closes #26266 from dengziming/feature_ui_optimize. Lead-authored-by: dengziming <dengziming@growingio.com> Co-authored-by: dengziming <swzmdeng@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Wenchen Fan

commit sha 31ae446e9c0be4dff2b75e510a2e1b65773d757e

[SPARK-29623][SQL] do not allow multiple unit TO unit statements in interval literal syntax ### What changes were proposed in this pull request? re-arrange the parser rules to make it clear that multiple unit TO unit statement like `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH` is not allowed. ### Why are the changes needed? This is definitely an accident that we support such a weird syntax in the past. It's not supported by any other DBs and I can't think of any use case of it. Also no test covers this syntax in the current codebase. ### Does this PR introduce any user-facing change? Yes, and a migration guide item is added. ### How was this patch tested? new tests. Closes #26285 from cloud-fan/syntax. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Eric Meisel

commit sha be022d9aeeb6b39f7b51427d848600e15a100c30

[SPARK-29677][DSTREAMS] amazon-kinesis-client 1.12.0 ### What changes were proposed in this pull request? Upgrading the amazon-kinesis-client dependency to 1.12.0. ### Why are the changes needed? The current amazon-kinesis-client version is 1.8.10. This version depends on the use of `describeStream`, which has a hard limit on an AWS account (10 reqs / second). Versions 1.9.0 and up leverage `listShards`, which has no such limit. For large customers, this can be a major problem. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26333 from etspaceman/kclUpgrade. Authored-by: Eric Meisel <eric.steven.meisel@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Dongjoon Hyun

commit sha 91d990162f13acde546d01e1163ed3e898cbf9a7

[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh ### What changes were proposed in this pull request? This PR aims to remove `check-cran` from `run-tests.sh`. We had better add an independent Jenkins job to run `check-cran`. ### Why are the changes needed? CRAN instability has been a blocker for our daily dev process. The following simple check causes consecutive failures in 4 of 9 Jenkins jobs + PR builder. ``` * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` - spark-branch-2.4-test-sbt-hadoop-2.6 - spark-branch-2.4-test-sbt-hadoop-2.7 - spark-master-test-sbt-hadoop-2.7 - spark-master-test-sbt-hadoop-3.2 - PRBuilder ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Currently, PR builder is failing due to the above issue. This PR should pass the Jenkins. Closes #26375 from dongjoon-hyun/SPARK-24152. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Dongjoon Hyun

commit sha 1ac6bd9f7929e7b907a0c2262e42b4d8a24ebbaa

[SPARK-29729][BUILD] Upgrade ASM to 7.2 ### What changes were proposed in this pull request? This PR aims to upgrade ASM to 7.2. - https://issues.apache.org/jira/browse/XBEAN-322 (Upgrade to ASM 7.2) - https://asm.ow2.io/versions.html ### Why are the changes needed? This will bring the following patches. - 317875: Infinite loop when parsing invalid method descriptor - 317873: Add support for RET instruction in AdviceAdapter - 317872: Throw an exception if visitFrame used incorrectly - add support for Java 14 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing UTs. Closes #26373 from dongjoon-hyun/SPARK-29729. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Dongjoon Hyun

commit sha 4bcfe5033c0523b182dd9ef72583f842cd386c48

[SPARK-29731][INFRA] Use public JIRA REST API to read-only access ### What changes were proposed in this pull request? This PR replaces `jira_client` API call for read-only access with public Apache JIRA REST API invocation. ### Why are the changes needed? This will reduce the number of authenticated API invocations. I hope this will reduce the chance of CAPCHAR from Apache JIRA site. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. ``` $ echo 26375 > .github-jira-max $ dev/github_jira_sync.py Read largest PR number previously seen: 26375 Retrieved 100 JIRA PR's from Github 1 PR's remain after excluding visted ones Checking issue SPARK-29731 Writing largest PR number seen: 26376 Build PR dictionary SPARK-29731 26376 Set 26376 with labels "PROJECT INFRA" ``` Closes #26376 from dongjoon-hyun/SPARK-29731. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Maxim Gekk

commit sha 80a89873b20aa07e2522bed5da0fc50e616246d9

[SPARK-29733][TESTS] Fix wrong order of parameters passed to `assertEquals` ### What changes were proposed in this pull request? The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter. ### Why are the changes needed? Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example: ```java assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L)); ``` ``` java.lang.AssertionError: Expected :interval 5 months 5 days 101 hours Actual :interval 5 months 5 days 102 hours ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests. Closes #26377 from MaxGekk/fix-order-in-assert-equals. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Sean Owen

commit sha 19b8c714369802d9057a98076edea94dd0cc3e9c

[SPARK-29674][CORE] Update dropwizard metrics to 4.1.x for JDK 9+ ### What changes were proposed in this pull request? Update the version of dropwizard metrics that Spark uses for metrics to 4.1.x, from 3.2.x. ### Why are the changes needed? This helps JDK 9+ support, per for example https://github.com/dropwizard/metrics/pull/1236 ### Does this PR introduce any user-facing change? No, although downstream users with custom metrics may be affected. ### How was this patch tested? Existing tests. Closes #26332 from srowen/SPARK-29674. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Liang-Chi Hsieh

commit sha e7263242bd9b9b2207147af4ab3ae4ec2ff3c718

Revert "[SPARK-24152][R][TESTS] Disable check-cran from run-tests.sh" ### What changes were proposed in this pull request? This reverts commit 91d990162f13acde546d01e1163ed3e898cbf9a7. ### Why are the changes needed? CRAN check is pretty important for R package, we should enable it. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #26381 from viirya/revert-SPARK-24152. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

view details

Kent Yao

commit sha 5ba17d09acf67e0bcd66a770f6b4e1abe7d52ecf

[SPARK-29722][SQL] Non reversed keywords should be able to be used in high order functions ### What changes were proposed in this pull request? Support non-reversed keywords to be used in high order functions. ### Why are the changes needed? the keywords are non-reversed. ### Does this PR introduce any user-facing change? yes, all non-reversed keywords can be used in high order function correctly ### How was this patch tested? add uts Closes #26366 from yaooqinn/SPARK-29722. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

wangguangxin.cn

commit sha 83c39d15e1206d404ae51390ba7b6937655b1980

[SPARK-29343][SQL] Eliminate sorts without limit in the subquery of Join/Aggregation ### What changes were proposed in this pull request? This is somewhat a complement of https://github.com/apache/spark/pull/21853. The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`. This PR try to remove this kind of `Sort` operator in `SQL Optimizer`. ### Why are the changes needed? For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`. 'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`. Remove useless `Sort` operator can improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Adding new UT `RemoveSortInSubquerySuite.scala` Closes #26011 from WangGuangxin/remove_sorts. Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

push time in 3 months

pull request commentapache/spark

[SPARK-27558][CORE] Gracefully cleanup task when it fails with OOM exception

@gatorsmile, I have created PR.

ayudovin

comment created time in 3 months

PR opened apache/spark

[SPARK-27558][CORE] Gracefully cleanup task when it fails with OOM exception

<!-- Thanks for sending a pull request! Here are some tips for you:

  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review. -->

What changes were proposed in this pull request?

When a task fails with OOM exception, the UnsafeInMemorySorter.array could be null. In the meanwhile, the cleanupResources() on task completion would call UnsafeInMemorySorter.getMemoryUsage in turn, and that lead to another NPE thrown.

Why are the changes needed?

Check if array is null in UnsafeInMemorySorter.getMemoryUsage and it should help to avoid NPE.

Does this PR introduce any user-facing change?

No

How was this patch tested?

It was tested manually.

+4 -0

0 comment

1 changed file

pr created time in 3 months

create barnchayudovin/spark

branch : fix-npe-in-listener-2.4

created branch time in 3 months

pull request commentapache/spark

[SPARK-27558][CORE] Gracefully cleanup task when it fails with OOM exception

@gatorsmile, ok, no problem, I'll do it

ayudovin

comment created time in 3 months

startedapache/spark

started time in 3 months

push eventayudovin/spark

Ryan Blue

commit sha 5ea134c3546aa0512a85cc2970d38f5e0345edde

[SPARK-28628][SQL] Implement SupportsNamespaces in V2SessionCatalog ## What changes were proposed in this pull request? This adds namespace support to V2SessionCatalog. ## How was this patch tested? WIP: will add tests for v2 session catalog namespace methods. Closes #25363 from rdblue/SPARK-28628-support-namespaces-in-v2-session-catalog. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>

view details

Shixiong Zhu

commit sha 89800931aa8b565335e45e1d26ff60402e46c534

[SPARK-3137][CORE] Replace the global TorrentBroadcast lock with fine grained KeyLock ### What changes were proposed in this pull request? This PR provides a new lock mechanism `KeyLock` to lock with a given key. Also use this new lock in `TorrentBroadcast` to avoid blocking tasks from fetching different broadcast values. ### Why are the changes needed? `TorrentBroadcast.readObject` uses a global lock so only one task can be fetching the blocks at the same time. This is not optimal if we are running multiple stages concurrently because they should be able to independently fetch their own blocks. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25612 from zsxwing/SPARK-3137. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

view details

Huaxin Gao

commit sha 56f2887dc8ab8d98e0c6a53db01ffb20cde19aa9

[SPARK-28788][DOC][SQL] Document ANALYZE TABLE statement in SQL Reference ### What changes were proposed in this pull request? Document ANALYZE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL reference ### Does this PR introduce any user-facing change? Yes ***Before***: There was no documentation for this. ***After***: ![image](https://user-images.githubusercontent.com/13592258/64046883-f8339480-cb21-11e9-85da-6617d5c96412.png) ![image](https://user-images.githubusercontent.com/13592258/64209526-9a6eb780-ce55-11e9-9004-53c5c5d24567.png) ![image](https://user-images.githubusercontent.com/13592258/64209542-a2c6f280-ce55-11e9-8624-e7349204ec8e.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25524 from huaxingao/spark-28788. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

view details

Thomas Graves

commit sha 4c8f114783241e780c0ad227bfb3191657637df7

[SPARK-27489][WEBUI] UI updates to show executor resource information ### What changes were proposed in this pull request? We are adding other resource type support to the executors and Spark. We should show the resource information for each executor on the UI Executors page. This also adds a toggle button to show the resources column. It is off by default. ![executorui1](https://user-images.githubusercontent.com/4563792/63891432-c815b580-c9aa-11e9-9f41-62975649efbc.png) ![Screenshot from 2019-08-28 14-56-26](https://user-images.githubusercontent.com/4563792/63891516-fd220800-c9aa-11e9-9fe4-89fcdca37306.png) ### Why are the changes needed? to show user what resources the executors have. Like Gpus, fpgas, etc ### Does this PR introduce any user-facing change? Yes introduces UI and rest api changes to show the resources ### How was this patch tested? Unit tests and manual UI tests on yarn and standalone modes. Closes #25613 from tgravescs/SPARK-27489-gpu-ui-latest. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

view details

hongdd

commit sha a838699ee06eaa993c6db859d112b16f363f433d

[SPARK-28694][EXAMPLES] Add Java/Scala StructuredKerberizedKafkaWordCount examples ### What changes were proposed in this pull request? Add Java/Scala StructuredKerberizedKafkaWordCount examples to test kerberized kafka. ### Why are the changes needed? Now,`StructuredKafkaWordCount` example is not support to visit kafka using kerberos authentication. ### Does this PR introduce any user-facing change? No ### How was this patch tested? ``` Yarn client: $ bin/run-example --files ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab \ --driver-java-options "-Djava.security.auth.login.config=${path}/kafka_driver_jaas.conf" \ --conf \ "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \ --master yarn sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \ subscribe topic1,topic2 Yarn cluster: $ bin/run-example --files \ ${jaas_path}/kafka_jaas.conf,${keytab_path}/kafka.service.keytab,${krb5_path}/krb5.conf \ --driver-java-options \ "-Djava.security.auth.login.config=./kafka_jaas.conf \ -Djava.security.krb5.conf=./krb5.conf" \ --conf \ "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf" \ --master yarn --deploy-mode cluster \ sql.streaming.StructuredKerberizedKafkaWordCount broker1-host:port,broker2-host:port \ subscribe topic1,topic2 ``` Closes #25649 from hddong/Add-StructuredKerberizedKafkaWordCount-examples. Lead-authored-by: hongdd <jn_hdd@163.com> Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Sean Owen

commit sha df39855db826fd4bead85a2ca01eda15c101bbbe

[SPARK-28963][BUILD] Fall back to archive.apache.org in build/mvn for older releases ### What changes were proposed in this pull request? Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release. ### Why are the changes needed? If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly. Closes #25667 from srowen/SPARK-28963. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

yangjie01

commit sha a07f795aead3bd81e7cccad30a7f6148c09ed8ad

[SPARK-28577][YARN] Resource capability requested for each executor add offHeapMemorySize ## What changes were proposed in this pull request? If MEMORY_OFFHEAP_ENABLED is true, add MEMORY_OFFHEAP_SIZE to resource requested for executor to ensure instance has enough memory to use. In this pr add a helper method `executorOffHeapMemorySizeAsMb` in `YarnSparkHadoopUtil`. ## How was this patch tested? Add 3 new test suite to test `YarnSparkHadoopUtil#executorOffHeapMemorySizeAsMb` Closes #25309 from LuciferYang/spark-28577. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

view details

angerszhu

commit sha 9f478a6832eaad61bcf6e974845b968105e153f5

[SPARK-28901][SQL] SparkThriftServer's Cancel SQL Operation show it in JDBC Tab UI ### What changes were proposed in this pull request? Current Spark Thirft Server can't support cancel SQL job, when we use Hue to query throgh Spark Thrift Server, when we run a sql and then click cancel button to cancel this sql, we will it won't work in backend and in the spark JDBC UI tab, we can see the SQL's status is always COMPILED, then the duration of SQL is always increasing, this may make people confused. ![image](https://user-images.githubusercontent.com/46485123/63869830-60338f00-c9eb-11e9-8776-cee965adcb0a.png) ### Why are the changes needed? If sql status can't reflect sql's true status, it will make user confused. ### Does this PR introduce any user-facing change? SparkthriftServer's UI tab will show SQL's status in CANCELED when we cancel a SQL . ### How was this patch tested? Manuel tested UI TAB Status ![image](https://user-images.githubusercontent.com/46485123/63915010-80a12f00-ca67-11e9-9342-830dfa9c719f.png) ![image](https://user-images.githubusercontent.com/46485123/63915084-a9292900-ca67-11e9-8e26-375bf8ce0963.png) backend log ![image](https://user-images.githubusercontent.com/46485123/63914864-1092a900-ca67-11e9-93f2-08690ed9abf4.png) Closes #25611 from AngersZhuuuu/SPARK-28901. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

view details

Jungtaek Lim (HeartSaVioR)

commit sha 712874fa0937f0784f47740b127c3bab20da8569

[SPARK-28931][CORE][TESTS] Fix couple of bugs in FsHistoryProviderSuite ### What changes were proposed in this pull request? This patch fixes the bugs in test code itself, FsHistoryProviderSuite. 1. When creating log file via `newLogFile`, codec is ignored, leading to wrong file name. (No one tends to create test for test code, as well as the bug doesn't affect existing tests indeed, so not easy to catch.) 2. When writing events to log file via `writeFile`, metadata (in case of new format) gets written to file regardless of its codec, and the content is overwritten by another stream, hence no information for Spark version is available. It affects existing test, hence we have wrong expected value to workaround the bug. This patch also removes redundant parameter `isNewFormat` in `writeFile`, as according to review comment, Spark no longer supports old format. ### Why are the changes needed? Explained in above section why they're bugs, though they only reside in test-code. (Please note that the bug didn't come from non-test side of code.) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified existing UTs, as well as read event log file in console to see metadata is not overwritten by other contents. Closes #25629 from HeartSaVioR/MINOR-FIX-FsHistoryProviderSuite. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

view details

Jungtaek Lim (HeartSaVioR)

commit sha 594c9c5a3ece0e913949c7160bb4925e5d289e44

[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer ## What changes were proposed in this pull request? This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following: * Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks. * It also enables applying different policies on pool, which helps optimization of pooling for each pool. * We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This patch properly handles the case. * It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter. Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits: * We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer. * We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily. * We can get various statistics for the object pool, and also be able to enable JMX for the pool. FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this. This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module. ## How was this patch tested? Existing unit tests as well as new tests for object pool. Also did some experiment regarding proving concurrent access of consumers for same topic partition. * Made change on both sides (master and patch) to log when creating Kafka consumer or fetching records from Kafka is happening. * branches * master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging * patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging * Test query (doing self-join) * https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2 * Ran query from spark-shell, with using `local[*]` to maximize the chance to have concurrent access * Collected the count of fetch requests on Kafka via command: `grep "creating new Kafka consumer" logfile | wc -l` * Collected the count of creating Kafka consumers via command: `grep "fetching data from Kafka consumer" logfile | wc -l` Topic and data distribution is follow: ``` truck_speed_events_stream_spark_25151_v1:0:99440 truck_speed_events_stream_spark_25151_v1:1:99489 truck_speed_events_stream_spark_25151_v1:2:397759 truck_speed_events_stream_spark_25151_v1:3:198917 truck_speed_events_stream_spark_25151_v1:4:99484 truck_speed_events_stream_spark_25151_v1:5:497320 truck_speed_events_stream_spark_25151_v1:6:99430 truck_speed_events_stream_spark_25151_v1:7:397887 truck_speed_events_stream_spark_25151_v1:8:397813 truck_speed_events_stream_spark_25151_v1:9:0 ``` The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier. The result of experiment is below: branch | create Kafka consumer | fetch request -- | -- | -- master | 1986 | 2837 patch | 8 | 1706 Closes #22138 from HeartSaVioR/SPARK-25151. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

view details

Dilip Biswal

commit sha b992160eae99b9d9c069f4db66ad8582cdf012f0

[SPARK-28811][DOCS][SQL] Document SHOW TBLPROPERTIES in SQL Reference ### What changes were proposed in this pull request? Document SHOW TBLPROPERTIES statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** ![image](https://user-images.githubusercontent.com/11567269/64281442-fdb92200-cf07-11e9-90ba-4699b6e93e23.png) ![Screen Shot 2019-09-04 at 11 32 11 AM](https://user-images.githubusercontent.com/11567269/64281484-188b9680-cf08-11e9-8e42-f130751ca495.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25571 from dilipbiswal/ref-show-tblproperties. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

view details

Dilip Biswal

commit sha f96486b4aaba36a0f843c8a52801c305b0fa2b16

[SPARK-28808][DOCS][SQL] Document SHOW FUNCTIONS in SQL Reference ### What changes were proposed in this pull request? Document SHOW FUNCTIONS statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. **Before:** There was no documentation for this. **After.** ![image](https://user-images.githubusercontent.com/11567269/64281840-e3cc0f00-cf08-11e9-9784-f01392276130.png) <img width="589" alt="Screen Shot 2019-09-04 at 11 41 44 AM" src="https://user-images.githubusercontent.com/11567269/64281911-0fe79000-cf09-11e9-955f-21b44590707c.png"> <img width="572" alt="Screen Shot 2019-09-04 at 11 41 54 AM" src="https://user-images.githubusercontent.com/11567269/64281916-12e28080-cf09-11e9-9187-688c2c751559.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25539 from dilipbiswal/ref-doc-show-functions. Lead-authored-by: Dilip Biswal <dbiswal@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

view details

maryannxue

commit sha a7a3935c97d1fe6060cae42bbc9229c087b648ab

[SPARK-11150][SQL] Dynamic Partition Pruning ### What changes were proposed in this pull request? This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: 1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or 2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise 3. As a bypassed condition (`true`). ### Why are the changes needed? This is an important performance feature. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT - Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature. - Testing DPP with reused broadcast results. - Testing the key iterators on different HashedRelation types. - Testing the packing and unpacking of the broadcast keys in a LongType. Closes #25600 from maryannxue/dpp. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

view details

Sean Owen

commit sha ded23f83ddbb84d0871a6cf40b07eb0a09f46234

[SPARK-28921][K8S][FOLLOWUP] Also bump K8S client version in integration-tests ### What changes were proposed in this pull request? Per https://github.com/apache/spark/pull/25640#issuecomment-527397689 also bump K8S client version in integration-tests module. ### Why are the changes needed? Harmonize the version as intended. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #25664 from srowen/SPARK-28921.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Xianjin YE

commit sha ca711778683a16999560cbdd7c61d98ad6bde6db

[SPARK-28907][CORE] Review invalid usage of new Configuration() ### What changes were proposed in this pull request? Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop ### Why are the changes needed? Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existed tests. Closes #25616 from advancedxy/remove_invalid_configuration. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

view details

Ryan Blue

commit sha 5adaa2e1035e1c9e4911c2c812e9e956221a1ca4

[SPARK-28979][SQL] Rename UnresovledTable to V1Table ### What changes were proposed in this pull request? Rename `UnresolvedTable` to `V1Table` because it is not unresolved. ### Why are the changes needed? The class name is inaccurate. This should be fixed before it is in a release. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25683 from rdblue/SPARK-28979-rename-unresolved-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Huaxin Gao

commit sha e4f70023ad5bee3c3515c3651722746776ef0ab6

[SPARK-28830][DOC][SQL] Document UNCACHE TABLE statement in SQL Reference ### What changes were proposed in this pull request? Document UNCACHE TABLE statement in SQL Reference ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes. After change: ![image](https://user-images.githubusercontent.com/13592258/64299133-e04a7f00-cf2c-11e9-8f39-9b288e46c995.png) ### How was this patch tested? Tested using jykyll build --serve Closes #25540 from huaxingao/spark-28830. Lead-authored-by: Huaxin Gao <huaxing@us.ibm.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

view details

Shixiong Zhu

commit sha 84a4d3a17ccbf7e0cb75dffbbdc20a26715f7323

[SPARK-28976][CORE] Use KeyLock to simplify MapOutputTracker.getStatuses ### What changes were proposed in this pull request? Use `KeyLock` added in #25612 to simplify `MapOutputTracker.getStatuses`. It also has some improvement after the refactoring: - `InterruptedException` is no longer sallowed. - When a shuffle block is fetched, we don't need to wake up unrelated sleeping threads. ### Why are the changes needed? `MapOutputTracker.getStatuses` is pretty hard to maintain right now because it has a special lock mechanism which we needs to pay attention to whenever updating this method. As we can use `KeyLock` to hide the complexity of locking behind a dedicated lock class, it's better to refactor it to make it easy to understand and maintain. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests. Closes #25680 from zsxwing/getStatuses. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

view details

Burak Yavuz

commit sha b9edd44bd6a3326f6a86431674ff3d65e0ed9b90

[SPARK-28964] Add the provider information to the table properties in saveAsTable ### What changes were proposed in this pull request? Adds the provider information to the table properties in saveAsTable. ### Why are the changes needed? Otherwise, catalog implementations don't know what kind of Table definition to create. ### Does this PR introduce any user-facing change? nope ### How was this patch tested? Existing unit tests check the existence of the provider now. Closes #25669 from brkyvz/provider. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Ryan Blue

commit sha dde393142fd52aa3c6021d108c835bcfab564951

[SPARK-28878][SQL] Remove extra project for DSv2 reads with columnar batches ### What changes were proposed in this pull request? Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution. ### Why are the changes needed? Removes an extra projection and copy. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

push time in 4 months

delete branch ayudovin/spark

delete branch : default-core-yarn

delete time in 4 months

PR opened apache/spark

[SPARK-27558] - Checking if array is null

<!-- Thanks for sending a pull request! Here are some tips for you:

  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review. -->

What changes were proposed in this pull request?

NPE is got when the array is null in UnsafeInMemorySorter.class.

Why are the changes needed?

These changes will check if array is null and it should help to avoid NPE.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

It was tested manually.

+4 -0

0 comment

1 changed file

pr created time in 4 months

create barnchayudovin/spark

branch : fix-npe-in-listener

created branch time in 4 months

issue closedClickHouse/clickhouse-jdbc

ClickHouseRowBinaryInputStream read FixedString(n)

I have table with FixedString(10) column. When I try to read using ClickHouseRowBinaryInputStream.readString() this column from a table I get the exception.

For example, my value is "0101712189" and I got

Caused by: java.io.EOFException at com.google.common.io.LittleEndianDataInputStream.readUnsignedByte(LittleEndianDataInputStream.java:83) at com.google.common.io.LittleEndianDataInputStream.readByte(LittleEndianDataInputStream.java:206) at ru.yandex.clickhouse.util.ClickHouseRowBinaryInputStream.readUnsignedLeb128(ClickHouseRowBinaryInputStream.java:41) at ru.yandex.clickhouse.util.ClickHouseRowBinaryInputStream.readString(ClickHouseRowBinaryInputStream.java:80) at com.badoo.Example.lambda$static$8f1f73c9$1(Example.java:35) at com.badoo.rdd.ClickHouseBinaryRdd$1.read(ClickHouseBinaryRdd.java:91) at com.badoo.rdd.ClickHouseBinaryIterator.next(ClickHouseBinaryIterator.java:46)

closed time in 4 months

ayudovin

issue commentClickHouse/clickhouse-jdbc

ClickHouseRowBinaryInputStream read FixedString(n)

close due to pull request was merged

ayudovin

comment created time in 4 months

pull request commentClickHouse/clickhouse-jdbc

read FixedString from BinaryInputStream

insert with fixedString is already supported?

general write string should work for you

ayudovin

comment created time in 4 months

pull request commentClickHouse/clickhouse-jdbc

read FixedString from BinaryInputStream

@alex-krash, I have added changes by your suggestions.

ayudovin

comment created time in 5 months

push eventayudovin/clickhouse-jdbc

yudovin

commit sha 0569ac40b7488ddd59d80c87021354c12d0df73e

adding test for string with zero-padded

view details

push time in 5 months

PR opened ClickHouse/clickhouse-jdbc

read FixedString from BinaryInputStream

ClickHouseRowBinaryInputStream can't read FixedString(n) and throw exception in this case, therefore the functionality for reading FixedString(n) has been added.

+15 -0

0 comment

2 changed files

pr created time in 5 months

create barnchayudovin/clickhouse-jdbc

branch : read-fixedString-from-binaryInputStream

created branch time in 5 months

issue openedClickHouse/clickhouse-jdbc

ClickHouseRowBinaryInputStream.readString()

I have table with FixedString(10) column. When I try to read using ClickHouseRowBinaryInputStream.readString() this column from a table I get the exception.

For example, my value is "0101712189" and I got

Caused by: java.io.EOFException at com.google.common.io.LittleEndianDataInputStream.readUnsignedByte(LittleEndianDataInputStream.java:83) at com.google.common.io.LittleEndianDataInputStream.readByte(LittleEndianDataInputStream.java:206) at ru.yandex.clickhouse.util.ClickHouseRowBinaryInputStream.readUnsignedLeb128(ClickHouseRowBinaryInputStream.java:41) at ru.yandex.clickhouse.util.ClickHouseRowBinaryInputStream.readString(ClickHouseRowBinaryInputStream.java:80) at com.badoo.Example.lambda$static$8f1f73c9$1(Example.java:35) at com.badoo.rdd.ClickHouseBinaryRdd$1.read(ClickHouseBinaryRdd.java:91) at com.badoo.rdd.ClickHouseBinaryIterator.next(ClickHouseBinaryIterator.java:46)

created time in 5 months

push eventAltoros/Query-Comparison

ayudovin

commit sha e7f458872968c367a1e99448c85b658d37d559a1

adding gitignore

view details

ayudovin

commit sha d5153c8dfc188db9a1cf935730375d31c50306b1

adding couchbase set up

view details

push time in 5 months

more