profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/nguyentr17/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Trang Nguyen nguyentr17 Tamr Cambridge, Boston https://www.linkedin.com/in/trangnguyen17/ Passion for anything related to data!

dennisliu19/MAPS17 0

This repository contains source code (R) of data visualization apps for the tangram games as well as the lab series of avoiding bad statistical practice.

nguyentr17/crossfilter 0

Fast n-dimensional filtering and grouping of records.

nguyentr17/Insults 0

Detect whether a social media comment is insulting or derogatory

nguyentr17/jirasurvivor 0

A bug leaderboard for JIRA issues

nguyentr17/kaggle-freesound-audio-tagging 0

Kaggle - Freesound General-Purpose Audio Tagging Challenge - Solution code

Pull request review commentDatatamer/terraform-aws-tamr-config

add TAMR_HBASE_EXTRA_URIS

 TAMR_CONNECTION_INFO_TYPE: hbase-site TAMR_HBASE_NAMESPACE: ${hbase_namespace} TAMR_HBASE_COMPRESSION: snappy TAMR_HBASE_CONFIG_URIS: s3://${tamr_data_bucket}/${hbase_config_path}hbase-site.xml+TAMR_HBASE_EXTRA_URIS: s3://${tamr_data_bucket}/${hbase_config_path}hbase-env.sh

That said, I think this would be a benign change if we did choose to add it.

Just someone needs to know why it's necessary.

keziah-tamr

comment created time in 2 days

Pull request review commentDatatamer/terraform-aws-tamr-config

add TAMR_HBASE_EXTRA_URIS

 TAMR_CONNECTION_INFO_TYPE: hbase-site TAMR_HBASE_NAMESPACE: ${hbase_namespace} TAMR_HBASE_COMPRESSION: snappy TAMR_HBASE_CONFIG_URIS: s3://${tamr_data_bucket}/${hbase_config_path}hbase-site.xml+TAMR_HBASE_EXTRA_URIS: s3://${tamr_data_bucket}/${hbase_config_path}hbase-env.sh

These are the only uncommented lines in the hbase-env.sh:

export HBASE_MANAGES_ZK=false
export HBASE_ROOT_LOGGER=INFO,DRFA
export HBASE_SECURITY_LOGGER=INFO,DRFAS
export HBASE_REGIONSERVER_OPTS=-Xmx12288m

They all seem to be server facing to me.

What's the justification as to why the client would need these settings too?

Automated tests do not set this property, and @daniosim 's recent internal deployments have not needed this either, so I'm not 💯 convinced that this is necessary.

keziah-tamr

comment created time in 2 days

pull request commentDatatamer/terraform-aws-tamr-config

add TAMR_HBASE_EXTRA_URIS

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

keziah-tamr

comment created time in 2 days

push eventDatatamer/terraform-aws-tamr-config

Keziah Katz

commit sha c7d416cf436e4fe3613c7072e712d3453713cd11

verions increment for TAMR_HBASE_EXTRA_URIS

view details

push time in 2 days

pull request commentDatatamer/terraform-aws-tamr-config

add TAMR_HBASE_EXTRA_URIS

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

keziah-tamr

comment created time in 2 days

pull request commentDatatamer/terraform-aws-tamr-config

add TAMR_HBASE_EXTRA_URIS

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

keziah-tamr

comment created time in 2 days

PR opened Datatamer/terraform-aws-tamr-config

add TAMR_HBASE_EXTRA_URIS

I was unable to add data to my AWS scale out Tamr instance without setting TAMR_HBASE_EXTRA_URIS. It should be included in this module. (relevant slack conversation: https://tamr.slack.com/archives/CT6SG4L2X/p1623346716073900)

+1 -0

0 comment

1 changed file

pr created time in 2 days

create barnchDatatamer/terraform-aws-tamr-config

branch : add-hbase-extra-uris

created branch time in 2 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 3 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 3 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 output "release_label" {   description = "The release label for the Amazon EMR release." } -output "core_group_instance_count" {-  value       = var.core_group_instance_count-  description = "Number of cores configured to execute the job flow"+output "master_instance_on_demand_count" {+  value       = var.master_instance_on_demand_count+  description = "Number of on-demand master instances configured to execute the job flow"+}++output "master_instance_spot_count" {+  value       = var.master_instance_spot_count+  description = "Number of spot master instances configured to execute the job flow"

Nit: I think without the "configured to execute the job flow" we convey the same information.

  description = "Number of spot master instances"
msoni-tamr

comment created time in 3 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 3 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 3 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 resource "aws_emr_cluster" "emr-cluster" {     key_name                          = var.key_pair_name   } --  master_instance_group {-    name          = var.master_instance_group_name-    instance_type = var.master_instance_type-    # NOTE: value must be 1 or 3-    instance_count = var.master_group_instance_count-    # Spot Instance definition-    bid_price = var.master_bid_price-    ebs_config {-      size                 = var.master_ebs_size-      type                 = var.master_ebs_type-      volumes_per_instance = var.master_ebs_volumes_count+  master_instance_fleet {

As of now, the test plan specifies testing spot instances for core nodes only. We may well try out spot instances for master nodes as well as testing progresses.

msoni-tamr

comment created time in 3 days

push eventDatatamer/terraform-azure-hdinsight-hbase

Ben Schwartzman

commit sha 77555081421f8943abb230bd0e43b4f16a12ec46

Use map for defining schedule

view details

push time in 3 days

Pull request review commentDatatamer/terraform-azure-hdinsight-hbase

DEV-14877: Support schedule-based scaling

 resource "azurerm_hdinsight_hbase_cluster" "hdinsight_hbase_cluster" {       subnet_id             = var.subnet_id       virtual_network_id    = var.vnet_id       target_instance_count = var.worker_count++      # Schedule based auto-scaling+      autoscale {+        recurrence {+          timezone = var.scaling_timezone+          dynamic "schedule" {+            for_each = var.scaling_times+            content {+              days                  = var.scaling_days+              time                  = schedule.value+              target_instance_count = var.scaled_target_instance_counts[schedule.key]

scaled_target_instance_counts and scaling_times are indirectly coupled.

Maintaining the two lists obviously works, but would it make sense to combine these two thins into a map since they're coupled? It may decrease the chance of someone not keeping the two lists in sync.

Something like:

{"13:00": 3, "14:00", 4}

You should be able to reference both the key and value in the dynamic block:

            content {
              days                  = var.scaling_days
              time                  = schedule.key
              target_instance_count = schedule.value

Maybe something like wild like this where we inject the days in that time too:

{"13:00": {["Monday","Tuesday"]: 3, "14:00", {["Monday", "Friday"]: 4}

Then you could do something like this:

            content {
              days                  = var.scaling_times[schedule.key].key
              time                  = schedule.key
              target_instance_count = var.scaling_times[schedule.key].value

Obviously haven't thought this through and I don't 💯 remember the constraints, but maybe there's something there.

schwartzmanb

comment created time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 output "release_label" {   description = "The release label for the Amazon EMR release." } -output "core_group_instance_count" {-  value       = var.core_group_instance_count

Good that you added these new outputs.

Could we keep core_group_instance_count output.

Maybe we could do something like:

value = var.core_instance_on_demand_count +  var.core_instance_spot_count

The total is likely going to be the interesting value here so that other modules (if I had to guess how this value is used).

msoni-tamr

comment created time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 module "emr-cluster" {   bootstrap_actions              = var.bootstrap_actions    # Cluster instances-  subnet_id                   = var.subnet_id-  key_pair_name               = var.key_pair_name-  master_instance_group_name  = var.master_instance_group_name-  master_instance_type        = var.master_instance_type-  master_group_instance_count = var.master_group_instance_count-  master_ebs_volumes_count    = var.master_ebs_volumes_count-  master_ebs_type             = var.master_ebs_type-  master_ebs_size             = var.master_ebs_size-  core_instance_group_name    = var.core_instance_group_name-  core_instance_type          = var.core_instance_type-  core_group_instance_count   = var.core_group_instance_count-  core_ebs_volumes_count      = var.core_ebs_volumes_count-  core_ebs_type               = var.core_ebs_type-  core_ebs_size               = var.core_ebs_size-  custom_ami_id               = var.custom_ami_id-  core_bid_price              = var.core_bid_price-  master_bid_price            = var.master_bid_price+  subnet_id                       = var.subnet_id+  key_pair_name                   = var.key_pair_name+  master_instance_group_name      = var.master_instance_group_name+  master_instance_type            = var.master_instance_type+  master_instance_on_demand_count = var.master_instance_on_demand_count+  master_ebs_volumes_count        = var.master_ebs_volumes_count+  master_ebs_type                 = var.master_ebs_type+  master_ebs_size                 = var.master_ebs_size+  core_instance_group_name        = var.core_instance_group_name+  core_instance_type              = var.core_instance_type+  core_instance_on_demand_count   = var.core_instance_on_demand_count+  core_ebs_volumes_count          = var.core_ebs_volumes_count+  core_ebs_type                   = var.core_ebs_type+  core_ebs_size                   = var.core_ebs_size+  custom_ami_id                   = var.custom_ami_id+  core_bid_price                  = var.core_bid_price+  master_bid_price                = var.master_bid_price 

Most (if not all) the new configs should probably be here too, otherwise to get access to them you'll have to use the nested module directly.

Maybe there's a case for excluding a few and being more opinionated at this level of the module. Not the current pattern, but I'm open to the discussion.

msoni-tamr

comment created time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 # TAMR AWS EMR module +## v4.0.0 - June 15, 2021+* Update cluster to use instance fleet, for a mix of on-demand and spot instances

Be specific about which variables have been removed or added.

I know that we're adding a lot of variables, so maybe we don't have to be exhaustive, but we should cover the ones that someone will need to update on upgrade. So that will be all removed and what they were replaced with (e.g. core_group_instance_count -> core_instance_on_demand_count).

This is helpful for someone upgrading from the previous version to know what needs to change.

msoni-tamr

comment created time in 4 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 4 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

Updated to make most values configurable Updated version and changelog

msoni-tamr

comment created time in 4 days

push eventDatatamer/terraform-azure-hdinsight-hbase

Ben Schwartzman

commit sha bc908f7226e7aafd0b2b84e0a517710a1af8d08d

Ignore schedule changes

view details

push time in 4 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 4 days

pull request commentDatatamer/terraform-aws-emr

Add support for instance fleets

This PR doesn't appear to be linked to a DevOps/SRE jira ticket

msoni-tamr

comment created time in 4 days

push eventDatatamer/tamr-toolbox

skalish

commit sha 8df457a0b344efc66c2ae2ae564da25460af9f37

TBOX-131: logging format (#59) * feat: include both logger name and filename in logging format * docs: update logger.create docstring to recommend module logging best practice * docs: update comments in example snippet to advise enabling package logging for custom modules

view details

push time in 4 days

PR merged Datatamer/tamr-toolbox

Reviewers
TBOX-131: logging format

↪️ Pull Request

This PR changes the logger format to include both the name of the logger and the name of the file where the logging is called. Previously only the name of the logger was shown. The logger name is valuable particularly for external packages with logging enabled, since this indicates the module where the logging call was made. For custom logging statements, however, the name of the logger is chosen by the script writer, so including the filename is important in this case.

Example: Run a script called script.py with the following contents:

# logger created with name "my_logger"
log = tbox.utils.logger.create("my_logger", log_directory=".")

log.info("an info message")  # direct logging call

tbox.utils.logger.enable_package_logging("tamr_unify_client")
tamr = Client(...)
tamr.request("GET", "service/version")  # tamr_unify_client method that creates a log

The resulting logs will be:

INFO <4717485504> [2021-06-10 15:31:08,315] my_logger <script.py:10>  an info message
INFO <4717485504> [2021-06-10 15:31:08,418] tamr_unify_client.client <client.py:95>  GET http://10.20.0.139:9100/api/versioned/v1/service/version : 404

✔️ PR Todo

+17 -10

1 comment

2 changed files

skalish

pr closed time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 resource "aws_emr_cluster" "emr-cluster" {     key_name                          = var.key_pair_name   } --  master_instance_group {-    name          = var.master_instance_group_name-    instance_type = var.master_instance_type-    # NOTE: value must be 1 or 3-    instance_count = var.master_group_instance_count-    # Spot Instance definition-    bid_price = var.master_bid_price-    ebs_config {-      size                 = var.master_ebs_size-      type                 = var.master_ebs_type-      volumes_per_instance = var.master_ebs_volumes_count+  master_instance_fleet {+    instance_type_configs {+      instance_type = var.master_instance_type+      ebs_config {+        size                 = var.master_ebs_size+        type                 = var.master_ebs_type+        volumes_per_instance = var.master_ebs_volumes_count+      }     }+    target_on_demand_capacity = 1+    name          = var.master_instance_group_name   }-  core_instance_group {-    name           = var.core_instance_group_name-    instance_type  = var.core_instance_type-    instance_count = var.core_group_instance_count-    # Spot Instance definition-    bid_price = var.core_bid_price-    ebs_config {-      size                 = var.core_ebs_size-      type                 = var.core_ebs_type-      volumes_per_instance = var.core_ebs_volumes_count++  core_instance_fleet {+    instance_type_configs {+      bid_price_as_percentage_of_on_demand_price = 100+      ebs_config {+        size                 = var.core_ebs_size+        type                 = var.core_ebs_type+        volumes_per_instance = var.core_ebs_volumes_count+      }+      instance_type     = var.core_instance_type+      weighted_capacity = 1+    }+    launch_specifications {+      spot_specification {+        allocation_strategy      = "capacity-optimized"+        block_duration_minutes   = 0

block_duration_minutes is interesting because essentially it's a more deterministic version of spot instances, where you're guaranteed the instance will be around for x minutes, but you'll also be guaranteed to lose the instance at the end of that duration. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/emr_cluster#block_duration_minutes

I'm not sure that's optimal for hbase to lose all the spot instances around the same time (most likely), so we might not want that yet.

Totally agree that timeout_duration_minutes should be a variable.

msoni-tamr

comment created time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 resource "aws_emr_cluster" "emr-cluster" {     key_name                          = var.key_pair_name   } --  master_instance_group {-    name          = var.master_instance_group_name-    instance_type = var.master_instance_type-    # NOTE: value must be 1 or 3-    instance_count = var.master_group_instance_count-    # Spot Instance definition-    bid_price = var.master_bid_price-    ebs_config {-      size                 = var.master_ebs_size-      type                 = var.master_ebs_type-      volumes_per_instance = var.master_ebs_volumes_count+  master_instance_fleet {+    instance_type_configs {+      instance_type = var.master_instance_type+      ebs_config {+        size                 = var.master_ebs_size+        type                 = var.master_ebs_type+        volumes_per_instance = var.master_ebs_volumes_count+      }     }+    target_on_demand_capacity = 1+    name          = var.master_instance_group_name   }-  core_instance_group {-    name           = var.core_instance_group_name-    instance_type  = var.core_instance_type-    instance_count = var.core_group_instance_count-    # Spot Instance definition-    bid_price = var.core_bid_price-    ebs_config {-      size                 = var.core_ebs_size-      type                 = var.core_ebs_type-      volumes_per_instance = var.core_ebs_volumes_count++  core_instance_fleet {+    instance_type_configs {+      bid_price_as_percentage_of_on_demand_price = 100+      ebs_config {+        size                 = var.core_ebs_size+        type                 = var.core_ebs_type+        volumes_per_instance = var.core_ebs_volumes_count+      }+      instance_type     = var.core_instance_type+      weighted_capacity = 1

Do you need this if you aren't making use of the aws_emr_instance_fleet resource?

msoni-tamr

comment created time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 resource "aws_emr_cluster" "emr-cluster" {     key_name                          = var.key_pair_name   } --  master_instance_group {-    name          = var.master_instance_group_name-    instance_type = var.master_instance_type-    # NOTE: value must be 1 or 3-    instance_count = var.master_group_instance_count-    # Spot Instance definition-    bid_price = var.master_bid_price-    ebs_config {-      size                 = var.master_ebs_size-      type                 = var.master_ebs_type-      volumes_per_instance = var.master_ebs_volumes_count+  master_instance_fleet {

Just a curiosity question - Does your testing involve only using spot instances for core nodes?

msoni-tamr

comment created time in 4 days

Pull request review commentDatatamer/terraform-aws-emr

Add support for instance fleets

 resource "aws_emr_cluster" "emr-cluster" {     key_name                          = var.key_pair_name   } --  master_instance_group {-    name          = var.master_instance_group_name-    instance_type = var.master_instance_type-    # NOTE: value must be 1 or 3-    instance_count = var.master_group_instance_count-    # Spot Instance definition-    bid_price = var.master_bid_price-    ebs_config {-      size                 = var.master_ebs_size-      type                 = var.master_ebs_type-      volumes_per_instance = var.master_ebs_volumes_count+  master_instance_fleet {+    instance_type_configs {+      instance_type = var.master_instance_type+      ebs_config {+        size                 = var.master_ebs_size+        type                 = var.master_ebs_type+        volumes_per_instance = var.master_ebs_volumes_count+      }     }+    target_on_demand_capacity = 1

Would we still want to make this configurable with the master_group_instance_count variable?

msoni-tamr

comment created time in 4 days