profile
viewpoint

dominictarr/curry 313

simple curry module, with nothing *too clever*, and full test coverage

hughfdjackson/immutable 98

neatly packages immutable equivalents to JavaScript's Objects and Arrays.

hughfdjackson/delve 25

Getting properties from values without the error-ridden headache.

hughfdjackson/persistent-hash-trie 17

Pure string:val storage, using structural sharing

hughfdjackson/curried 12

Awesome curried standard library.

hughfdjackson/tap 12

Tap your functions into method chains.

hughfdjackson/flywheel 6

Animation loop utility lib.

hughfdjackson/calculator 4

A Simple Haskell Calculator, using Parsec

hughfdjackson/lmbd 4

Sugar for your tinest of functions.

create barnchhughfdjackson/fluorescent-fibre-counting

branch : less-variation-for-arwj

created branch time in 2 months

issue commentapache/hudi

[SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

@bvaradar - thanks for your help. I think we're going to try the above approach, but it's something we might return to later.

Closing the issue for now sounds like a good idea.

hughfdjackson

comment created time in 2 months

issue commentapache/hudi

[SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

Hi @bvaradar -

In general getting incremental read to discard duplicates is not possible for MOR table types as we defer the merging of records to compaction.

That's interesting - as your comment suggests, I've only looked at CoW tables in any depth. I look forward to delving into MoR's design in a bit more detail so I can get my head around what the implications of such a feature would be there + understand your comment better.

I was thinking about alternate ways to achieve your use-case for COW table by using an application level boolean flag. Let me know if this makes sense:

Introduce additional boolean column "changed". Default Value is false.
Have your own implementation of HoodieRecordPayload plugged-in.
3a In HoodieRecordPayload.getInsertValue(), return an avro record with changed = true. This function is called first time when the new record is inserted.
3(b) In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, there is no material change, set changed = false else set it to true.

In your incremental query, add the filter changed = true to filter out those without material changes ?

That does make sense, although I think a boolean column may lead to missing changes if the incremental read spans two or more commits to the same row. I'm spiking a variation on that suggesting with my team, wherein:

  1. Introduce a 'last_updated_timestamp', default to null (i.e. the update was in this commit)
  2. Have your own implementation of HoodieRecordPayload plugged-in.
  3. a. In HoodieRecordPayload.getInsertValue(), return an avro record with last_updated_timestamp = null.*
  4. b. In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, there is no material change, set last_updated_timestamp to that of the old record (if it exists) or to the old record's commit_time.

In the incremental query, we're filtering for null (which indicates that one of the commits within the timeline last updated the record) or for last_updated_timestamp within the beginInstant and endInstant bounds.

We've not tested it extensively, but it looks like a promising workaround so far.


* It'd be 'cleaner' to set this equal to the commit time of the write, but in our HoodieRecordPayload class, that's not available unfortunately. The 'null means insert' + special case handling in HoodieRecordPayload.combineAndGetUpdateValue() is a work-around for that.

hughfdjackson

comment created time in 2 months

issue commentapache/hudi

[SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

@bvaradar - As a follow-up question, your reply confirms that what we're looking for (ideally) isn't a Hudi feature currently. Is it something you might be interested in supporting?

In many use cases, the behaviour would likely be identical to the current - for snapshot queries, or for incrementally reading tables where the writer ensures only material changes* are written (e.g. some stream processing, or insert-only batch processes). In the remaining use-cases like ours, it would cut back on a lot of noise + processing.

If so, I can talk to my team about contributing towards the project, since it would be valuable to us.


* I'm using 'material changes' here to describe an upsert that impacts on the non-_hoodie columns. Either a deletion, or a change in value to one of those columns.

hughfdjackson

comment created time in 2 months

issue commentapache/hudi

[SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

Hi @bvaradar - thanks for the reply! And for the suggestion.

In our use case, we're interested in both incremental read of material changes, and in using the Hudi table with regular snapshot queries. I would expect 30-50% incremental reads, and 50-70% snapshot queries.

If I'm understanding correctly, your suggestion would essentially lead to an event log of all material changes to an entity. If you do a snapshot query against that data, you'd end up with lots of duplicates, so each query would need to include de-duplication to reproduce the a materialised view with the latest data for each entity.

Is that right?

hughfdjackson

comment created time in 2 months

issue openedapache/hudi

[SUPPORT]: Incremental read returns all upserted rows, even if no material change has occurred.

Describe the problem you faced

My team are interested in writing to Hudi tables using a repeated batch process that often upserts data that's identical to what's already there. For instance, we may be:

  • recalculating # of times a particular set of event has occurred
  • re-running a query over the last week of data, to include potentially late arriving data.

We also have some consumers that want to consume these tables incrementally (to ingest the latest results into local databases, or monitor the changes). Ideally, these consumers would only see the 1% of records that have changed, rather than all records involved in the upsert.

However, in our testing, it seems like the incremental query returns all records that were involved in the upsert, even if they were overwriting identical data.

(As far as I can tell, this happens here: https://github.com/apache/hudi/blob/release-0.5.3/hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java#L238-L244, no matter which PAYLOAD_CLASS_OPT_KEY class is used).

To Reproduce

Steps to reproduce the behavior:

  1. clone hudi git repo, checkout release-0.5.3-rc2 and run mvn clean package -DskipTests -DskipITs
  2. Copy packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.5.3-rc2.jar to EMR master node
  3. Run the following spark shell on master, with the command: spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars hudi-spark-bundle_2.11-0.5.3-rc2.jar,/usr/lib/spark/external/lib/spark-avro.jar -i spark-shell-script

where spark-shell-script contents is:

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.spark.sql.SaveMode
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.spark.sql.DataFrame
import org.apache.hudi.common.table.HoodieTableMetaClient
import org.apache.hudi.table.HoodieTable
import org.apache.hudi.config.HoodieWriteConfig
  
// Helper functions
val basePath = "s3://{s3BucketNameAndPrefixPath}"
val tableName = "hudi_incremental_read_test"
def write(df: DataFrame, saveMode: SaveMode = Append) = df.write.format("hudi")
    .option(PRECOMBINE_FIELD_OPT_KEY, "ts")
    .option(RECORDKEY_FIELD_OPT_KEY, "uuid")
    .option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath")
    .option("hoodie.consistency.check.enabled", "true")
    .option(TABLE_NAME, tableName)
    .mode(saveMode)
    .save(basePath)
def incrementalRead(beginInstant: String) = { 
    println(s"READING FROM $beginInstant")   
    spark.read
     .format("hudi")
     .option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL)
     .option(BEGIN_INSTANTTIME_OPT_KEY, beginInstant)
     .load(basePath)
} 
def latestCommitInstant() = { 
  val metaClient = new HoodieTableMetaClient(spark.sparkContext.hadoopConfiguration, basePath, true)
  val hoodieTable = HoodieTable.getHoodieTable(metaClient, HoodieWriteConfig.newBuilder().withPath(basePath).build(), spark.sparkContext)
  
  hoodieTable.getMetaClient.getCommitTimeline.filterCompletedInstants().lastInstant.get.getTimestamp
}

def justBefore(commitTime: String) = (commitTime.toLong - 1).toString
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

write(df, saveMode=Overwrite)

println("""
----------- INCREMENTAL READ -------
""")
println("The whole table is new, so I'm expecting all 10 rows to be returned on incremental read")
incrementalRead(justBefore(latestCommitInstant)).show()

// generate an update for a single row
val updates = convertToStringList(dataGen.generateUpdates(1))
val updatesDF = spark.read.json(spark.sparkContext.parallelize(updates, 2))

println("""
----------- INCREMENTAL READ -------
""")
println("Now we're updating a row, we expect to see the updated row only on incremental read, which we do")
write(updatesDF)
incrementalRead(justBefore(latestCommitInstant)).show()

println("""
----------- INCREMENTAL READ -------
""")
println("Re-upserting the same row twice causes it to be 'emitted' twice to the incremental reader, even though the contents of the second reading are identical from the first (metadata aside)")
write(updatesDF)
incrementalRead(justBefore(latestCommitInstant)).show()

That results in:

----------- INCREMENTAL READ -------
The whole table is new, so I'm expecting all 10 rows to be returned on incremental read
READING FROM 20200818091617
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|       partitionpath|    rider| ts|                uuid|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
|     20200818091618|  20200818091618_1_1|ecde6618-0cbc-4b6...|  americas/united_s...|3e9b3e64-3895-46a...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|americas/united_s...|rider-213|0.0|ecde6618-0cbc-4b6...|
|     20200818091618|  20200818091618_1_2|c9a45eda-fe53-480...|  americas/united_s...|3e9b3e64-3895-46a...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|americas/united_s...|rider-213|0.0|c9a45eda-fe53-480...|
|     20200818091618|  20200818091618_1_3|35808b31-2d1e-474...|  americas/united_s...|3e9b3e64-3895-46a...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|americas/united_s...|rider-213|0.0|35808b31-2d1e-474...|
|     20200818091618|  20200818091618_1_4|67e1c9d5-a3c0-4f7...|  americas/united_s...|3e9b3e64-3895-46a...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|americas/united_s...|rider-213|0.0|67e1c9d5-a3c0-4f7...|
|     20200818091618|  20200818091618_1_5|8fdf91c8-b0ca-46c...|  americas/united_s...|3e9b3e64-3895-46a...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|americas/united_s...|rider-213|0.0|8fdf91c8-b0ca-46c...|
|     20200818091618|  20200818091618_0_1|2efbfbf1-aa1f-40f...|  americas/brazil/s...|a71d09b8-7cc8-408...| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|americas/brazil/s...|rider-213|0.0|2efbfbf1-aa1f-40f...|
|     20200818091618|  20200818091618_0_2|2bbebad3-1a3c-4f1...|  americas/brazil/s...|a71d09b8-7cc8-408...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|americas/brazil/s...|rider-213|0.0|2bbebad3-1a3c-4f1...|
|     20200818091618|  20200818091618_0_3|2c3d179c-899f-42f...|  americas/brazil/s...|a71d09b8-7cc8-408...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|americas/brazil/s...|rider-213|0.0|2c3d179c-899f-42f...|
|     20200818091618|  20200818091618_2_1|3c9add87-8347-41d...|    asia/india/chennai|df2d7f47-0d10-43b...|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|  asia/india/chennai|rider-213|0.0|3c9add87-8347-41d...|
|     20200818091618|  20200818091618_2_2|8cd8ff41-791e-43a...|    asia/india/chennai|df2d7f47-0d10-43b...|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|  asia/india/chennai|rider-213|0.0|8cd8ff41-791e-43a...|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+--------------------+---------+---+--------------------+
----------- INCREMENTAL READ -------
Now we're updating a row, we expect to see the updated row only on incremental read, which we do
20/08/18 09:17:36 WARN IncrementalTimelineSyncFileSystemView: Incremental Sync of timeline is turned off or deemed unsafe. Will revert to full syncing
READING FROM 20200818091705
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+------------------+----------+------------------+------------------+------------------+--------------------+---------+---+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|         begin_lat|         begin_lon|    driver|           end_lat|           end_lon|              fare|       partitionpath|    rider| ts|                uuid|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+------------------+----------+------------------+------------------+------------------+--------------------+---------+---+--------------------+
|     20200818091706|  20200818091706_0_3|35808b31-2d1e-474...|  americas/united_s...|3e9b3e64-3895-46a...|0.7340133901254792|0.5142184937933181|driver-284|0.7814655558162802|0.6592596683641996|49.527694252432056|americas/united_s...|rider-284|0.0|35808b31-2d1e-474...|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+------------------+----------+------------------+------------------+------------------+--------------------+---------+---+--------------------+
----------- INCREMENTAL READ -------
Re-upserting the same row twice causes it to be 'emitted' twice to the incremental reader, even though the contents of the second reading are identical from the first (metadata aside)
20/08/18 09:18:04 WARN IncrementalTimelineSyncFileSystemView: Incremental Sync of timeline is turned off or deemed unsafe. Will revert to full syncing
READING FROM 20200818091736
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+------------------+----------+------------------+------------------+------------------+--------------------+---------+---+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|         begin_lat|         begin_lon|    driver|           end_lat|           end_lon|              fare|       partitionpath|    rider| ts|                uuid|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+------------------+----------+------------------+------------------+------------------+--------------------+---------+---+--------------------+
|     20200818091737|  20200818091737_0_4|35808b31-2d1e-474...|  americas/united_s...|3e9b3e64-3895-46a...|0.7340133901254792|0.5142184937933181|driver-284|0.7814655558162802|0.6592596683641996|49.527694252432056|americas/united_s...|rider-284|0.0|35808b31-2d1e-474...|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+------------------+----------+------------------+------------------+------------------+--------------------+---------+---+--------------------+

Expected behavior

Ideally (in our use case), upserting a row whose contents is identical doesn't cause an incremental reader to read the data again.

Environment Description

  • Hudi version : 0.5.3-rc2, built from source

  • Spark version : 2.4.4 (Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252)

  • Hive version : 2.3.6

  • Hadoop version : 2.8.5-amzn-5

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

  • EMR Version : emr-5.29.0

created time in 2 months

startedliljencrantz/crush

started time in 3 months

more