profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/TanjaBayer/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.

TanjaBayer/aioice 0

asyncio-based Interactive Connectivity Establishment (RFC 5245)

TanjaBayer/ID-Card-Segmentation 0

Segmentation of ID Cards using Semantic Segmentation

TanjaBayer/insightface 0

State-of-the-art 2D and 3D Face Analysis Project

TanjaBayer/mrz 0

Machine Readable Zone generator and checker for official travel documents sizes 1, 2, 3, MRVA and MRVB (Passports, Visas, national id cards and other travel documents)

TanjaBayer/nx-plugins 0

Nx plugins built by FlowAccount team, helps deploy systems to the cloud

TanjaBayer/PKGBUILD 0

A list of PKGBUILDs I maintain or contribute to for Arch Linux (AUR).

TanjaBayer/pycodestyle 0

Simple Python style checker in one Python file

TanjaBayer/pyseq 0

Python implementation of Needleman-Wunsch and Hirschberg algorithm.

issue commentray-project/ray

[runtime_env] [serve] Updating deployment that uses working_dir causes backend_state.update() to fail

@edoakes zou can remove the first deploy, it only seems to be related to the num replicas, this also failes

import ray
from ray import serve

ray.init(runtime_env={"working_dir": "subdir"})
serve.start()

@serve.deployment
class MyDeployment:
    def __init__(self):
        pass

MyDeployment.options(num_replicas=2).deploy()
edoakes

comment created time in 21 days

issue commentray-project/ray

[serve] Cannot create actor from within a serve deployment when using working_dir

Hm I can check, our code if there is some error in the constructor which is maybe silent. But the question would be why it worked before but is now raising the error

edoakes

comment created time in 21 days

issue commentray-project/ray

[serve] Cannot create actor from within a serve deployment when using working_dir

@edoakes

I think you do not even need to have the actor, I can get the same error with this, seems like the open part is already enough

import ray
from ray import serve
import time
ray.init(runtime_env={"working_dir": "subdir"})
serve.start()

@serve.deployment
class MyDeployment:
    def __init__(self):

        with open("test", "w") as f:
            self._f = f.read()


    def __call__(self, *args):
        return True
MyDeployment.deploy()
handle = MyDeployment.get_handle()
print(ray.get(handle.remote()))
edoakes

comment created time in 21 days

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

I was able to reproduce it on a local machine with this script, which is completely independent from our code, but does something similar:

import asyncio
import os

import ray
from ray import serve

ray.init(os.environ.get('RAY_CLIENT_VIP', 'ray://localhost:10001'),
             runtime_env={'working_dir': '.', 'excludes': ['.git/']},
             namespace='random')
client = serve.start(detached=True,
                     http_options={'host': '0.0.0.0', 'port': '8000'})


@serve.deployment
class TestBackend:
    """ Actor for proving functionality form static_holo_verification library

    """

    def __init__(self):
        pass

    async def __call__(self) -> None:
        """ Just wait"""
        import torch
        torch.cuda.is_available()
        await asyncio.sleep(10)

runtime_env_gpu = {
        "conda": {"channels":[ "conda-forge", "defaults", "anaconda"],
                "dependencies": [
                                "pip",
                                {"pip": ["tensorflow-gpu==1.14", "numpy==1.21.1", "onnxruntime-gpu", "mxnet", "starlette", "ray[serve]",
                                         "pycaret", "opencv-python-headless", "pandas", "Pillow", "SciPy", "matplotlib",
                                         '-f https://download.pytorch.org/whl/torch_stable.html',
                                         'torch==1.8.1+cu111',
                                         '-f https://download.pytorch.org/whl/torch_stable.html',
                                         'torchvision==0.9.1+cu111']}]},
        "env_vars":{
            "OMP_NUM_THREADS": "1",
            "TF_WARNINGS": "none",
            }
        }
backends = [(TestBackend,{
            'num_replicas': 1,
            "max_concurrent_queries": 1,
            'ray_actor_options': {'num_cpus': 1, 'num_gpus': 0.1, 'runtime_env': runtime_env_gpu}
        })]

for func, parameters in backends:
    func.options(**parameters).deploy()

Starting up a ray cluster with ray start --head and than just executing the script.

Using: ray, version 2.0.0.dev0
'_ray_commit': '0e968c1e826ddd1032d06c96aeba0420c1269347'

It failed approximately after 6min.

edoakes

comment created time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

The OOM is no longer happening, so that doesn't seem to be related.

Yes

Now that you've changed the working_dir, you're no longer seeing the errors about No such file or directory

Yes

I hope I am able to attach the logs tomorrow

edoakes

comment created time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

Btw, @TanjaBayer as an experiment what happens if you don't use runtime_env at all? Do you see the same issues?

@edoakes did not yet have the possibility yet to test that. Do you have any example script I could try, preferably it should be a bit more complicated than just starting one serve backend :smile:

My scripts will not work because they will instantly fail if they do not have the appropriate requirements installed in the runtime...

edoakes

comment created time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

@TanjaBayer Would you mind sharing what working_dir is set to in the runtime_env

You are right, I defenitely don't want anaconda3/lib in there I just yesterday changed the WORKDIR in my Dockerfile which is used for deploying the kubernetes job to point to /home/ray to be able to write a file with open('filename', 'r'). My workdir is ./ and as the anaconda3 in the rayproject/ray:1.6.0-py37 is in this location it resulted in the above error. I just changed the WORKDIR to point /home/ray so I am not seeing the above error anymore.

However the crashing dashboard is still a problem :cry:. So no idea if it can in any way be related...

edoakes

comment created time in a month

PR opened deepinsight/insightface

Provide possibility to add onnxruntime provider

If not defined otherwise the FaceAnalysis blocks the whole GPU memory although not required by the models.

Onnxruntime provides the possibility to limit the memory used by a process by passing the provider and provider_options parameter:

https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html

However according to the Docstrings of onnxruntime-gpu 1.8.1 this is not the exact correct way to pass it:

class InferenceSession(Session):
    """
    This is the main class used to run a model.
    """
    def __init__(self, path_or_bytes, sess_options=None, providers=None, provider_options=None, **kwargs):
        """
        :param path_or_bytes: filename or serialized ONNX or ORT format model in a byte string
        :param sess_options: session options
        :param providers: Optional sequence of providers in order of decreasing
            precedence. Values can either be provider names or tuples of
            (provider name, options dict). If not provided, then all available
            providers are used with the default precedence.
        :param provider_options: Optional sequence of options dicts corresponding
            to the providers listed in 'providers'.

Giving the developer the right to pass its own options for this parameter would improve the way of using the package in my opinion.

An example call would look like:

  self.model = FaceAnalysis(
                root='path/to/models/'
                providers=["CUDAExecutionProvider"],
                provider_options=[{"device_id": 0, 
                                   "gpu_mem_limit": gpu_mem_limit*1024*1024,
                                   'cudnn_conv_algo_search': 'HEURISTIC'}],)

This is an really easy implementation as it assigns the same amount of memory for each of the models, providing different options for each model would also be possible however would also require some more changes

+14 -13

0 comment

2 changed files

pr created time in a month

push eventTanjaBayer/insightface

Tanja Bayer

commit sha 6538da9787910f0fab34ec406a917cf8f5f9045a

Provide possibility to add provider

view details

push time in a month

push eventTanjaBayer/insightface

Jia Guo

commit sha cdc3d4ed5de14712378f3d5a14249661e54a03ec

package 0.4.1

view details

push time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

I am not sure if those issues are really related or not, however now I see this error beeing thrown when trying to startup:

Issue with path: /home/ray/anaconda3/lib/libstdc++.so.6.0.21
Got Error from logger channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception iterating responses: Server startup failed."
	debug_error_string = "{"created":"@1629723876.209273458","description":"Error received from peer ipv4:10.233.61.116:10001","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Exception iterating responses: Server startup failed.","grpc_status":2}"
>
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/logsclient.py", line 42, in _log_main
    for record in log_stream:
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception iterating responses: Server startup failed."
	debug_error_string = "{"created":"@1629723876.209273458","description":"Error received from peer ipv4:10.233.61.116:10001","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Exception iterating responses: Server startup failed.","grpc_status":2}"
>
Traceback (most recent call last):
  File "serve_backend_api.py", line 69, in <module>
    ray.init(RAY_CLIENT_VIP, runtime_env=runtime_env_api, namespace='abc')
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 773, in init
    return builder.connect()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/client_builder.py", line 101, in connect
    self.address, job_config=self._job_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client_connect.py", line 35, in connect
    ignore_version=ignore_version)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/__init__.py", line 83, in connect
    self.client_worker._server_init(job_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/worker.py", line 529, in _server_init
    runtime_env.rewrite_runtime_env_uris(job_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 538, in rewrite_runtime_env_uris
    pkg_name = get_project_package_name(working_dir, py_modules, excludes)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 401, in get_project_package_name
    _get_excludes(working_dir, excludes)))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 315, in _hash_modules
    _dir_travel(root, excludes, handler)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 264, in _dir_travel
    _dir_travel(sub_path, excludes, handler)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 264, in _dir_travel
    _dir_travel(sub_path, excludes, handler)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 264, in _dir_travel
    _dir_travel(sub_path, excludes, handler)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 261, in _dir_travel
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 258, in _dir_travel
    handler(path)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/runtime_env.py", line 306, in handler
    with path.open("rb") as f:
  File "/home/ray/anaconda3/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/ray/anaconda3/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/anaconda3/lib/libstdc++.so.6.0.21'

It seems like the lib is not linked correctly, which causes the error, and it seems this issue is not new, as it is there in several image which I have on my local machine without pulling them (e.g. rayproject/ray:1.5.0-py37-gpu): image

edoakes

comment created time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

I think the OOM error was due to the constant restarting of the process, and not stopping the failing one. However the OOM did only happen on the one cluster, the logs from the other cluster (see zip files above) do not include any OOM errors. So I do not think that the OOM is the cause of the dashbord crash

image

It logs of the serve controller should be part of the worker logs, right? Or is there any other option to get logs from the serve conroller.

edoakes

comment created time in a month

push eventTanjaBayer/insightface

Jia Guo

commit sha 6bc2be44545c0954d4c2599d89ecf03624a6c6b6

Merge pull request #1720 from TanjaBayer/master Mxnet should not be mandatory for the python package

view details

push time in a month

PR opened deepinsight/insightface

Mxnet should not be mandatory for the python package

Right now the structure of the imports requires mxnet to be installed. However mxnet is not the default backend for running inference, thus should not be required to be installed when using the package for inference.

Changing the import structure does not require mxnet to be installed now when running the FaceAnalysis

+5 -5

0 comment

2 changed files

pr created time in a month

push eventTanjaBayer/insightface

Tanja Bayer

commit sha 1dd55df0b19b4b4c3b291cf7780f3b4c299c6979

Fix import in other module

view details

push time in a month

push eventTanjaBayer/insightface

Tanja Bayer

commit sha 25f9d89bf793c1dee996ec235b4a09dbc43b3538

Revert "Fix import in other file" This reverts commit ab56f1ced65c7fd3183bbbd8a0f507fe9982395b.

view details

push time in a month

push eventTanjaBayer/insightface

Tanja Bayer

commit sha ab56f1ced65c7fd3183bbbd8a0f507fe9982395b

Fix import in other file

view details

push time in a month

push eventTanjaBayer/insightface

Tanja Bayer

commit sha 2ee118cac21739c7dc315102085ac8c05fd18918

Change import strucutre so that mxnet is not mandatory

view details

push time in a month

push eventTanjaBayer/insightface

Jia Guo

commit sha 81a27d41bd07796d1193027daa9d4e5ee23cdd4f

Merge pull request #1712 from TanjaBayer/master Catch Module Not Found Error for Pypandoc

view details

Vinh Quang Tran

commit sha aee09e3ccf880f2da18458a04e4687fc0a2d6c59

Remove unused code

view details

Jia Guo

commit sha 4b9ea7bce9a581c079d72dac17f73b85b78822b6

Merge pull request #1714 from vinhtq115/master Remove unused code in verification.py

view details

push time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

I need to correct myself the errors are there on the second cluster again, it just took a bit longer than before (maybe less failures) The whole content of the logs folder: image Here are the logs from the head node, however I did not include the logs for the runtime environments as they contain sensible data logs.zip

edoakes

comment created time in a month

issue commentray-project/ray

Dashboard crashing due to GCS RPC errors

We have currently two Cluster with a similar setup, the one cluster I completely restarted this morning, to get new logs, this means it also pulled the latest updated docker image for the tag, on this cluster it is right now not failing anymore. But I also did some code changes on our code (No failing serve backends anymore). Could that mean that this dashboard crashing issue is also caused by this: https://github.com/ray-project/ray/issues/17823?

However on the second cluster I still have some logs which I can provide here:

Info about the System / Setup

  • Cluster launched via helm operator
  • 1 head node (4CPU, 30 GB RAM, rayResources: 0CPU) - 2 worker nodes (12CPU, 1GPU, 20GB RAM)
  • Docker Image: rayproject/ray-ml:1.6.0-py37-gpu
  • The commit of the failing version: b51c17576bfd6f8d0ab66c2a0e2f5914f9fb2921

GCS Logs

File size in mb:

330     ./gcs_server.out
0       ./gcs_server.err
512     ./python-core-worker-697c74c78c7da879497a7f73c31e02a5b5bb2d4720ed1977f5545f5e_2712.1.log
30      ./worker-697c74c78c7da879497a7f73c31e02a5b5bb2d4720ed1977f5545f5e-01000000-2712.err
47      ./python-core-worker-697c74c78c7da879497a7f73c31e02a5b5bb2d4720ed1977f5545f5e_2712.log

gcs_server.out

The log is crowded with logs like this:

[2021-08-20 00:32:55,534 I 157 157] gcs_actor_manager.cc:817: Actor is failed 9a32a838f33a1aa39294a70601000000 on worker 9dde9e8f0e87d8ad46da4ccb48fdb98b65a6745b43b23e9e71bbe1f3 at node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50, need_reschedule = 0, remaining_restarts = 0, job id = 01000000
[2021-08-20 00:32:55,535 I 157 157] gcs_actor_manager.cc:156: Finished creating actor, job id = 01000000, actor id = 9a32a838f33a1aa39294a70601000000
[2021-08-20 00:32:55,535 I 157 157] gcs_actor_manager.cc:552: Destroying actor, actor id = 9a32a838f33a1aa39294a70601000000, job id = 01000000
[2021-08-20 00:32:55,656 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#zCeKCr' was not found.
[2021-08-20 00:32:55,764 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#zCeKCr' was not found.
[2021-08-20 00:32:55,765 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#zCeKCr' was not found.
[2021-08-20 00:32:55,766 I 157 157] gcs_placement_group_scheduler.cc:533: Cancelling all committed bundles of a placement group, id is fd714a6ea2614d89001456708ca776e4
[2021-08-20 00:32:55,766 I 157 157] gcs_placement_group_manager.cc:330: Placement group of an id, fd714a6ea2614d89001456708ca776e4 is removed successfully.
[2021-08-20 00:32:55,928 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#qZFakO' was not found.
[2021-08-20 00:32:55,929 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#qZFakO' was not found.
[2021-08-20 00:32:55,930 I 157 157] gcs_placement_group_manager.cc:249: Successfully created placement group SERVE_CONTROLLER_ACTOR:***Actor#qZFakO_placement_group, id: 38f226e18e232c86f23efa922e6ed897
[2021-08-20 00:32:55,930 I 157 157] gcs_resource_manager.cc:60: Updating resources, node id = 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50
[2021-08-20 00:32:55,932 I 157 157] gcs_actor_manager.cc:128: Registering actor, job id = 01000000, actor id = 0bc4ac5663903402bd6255a101000000
[2021-08-20 00:32:55,932 W 157 157] task_spec.cc:48: More than 43581 types of tasks seen, this may reduce performance.
[2021-08-20 00:32:55,932 I 157 157] gcs_actor_manager.cc:133: Registered actor, job id = 01000000, actor id = 0bc4ac5663903402bd6255a101000000
[2021-08-20 00:32:55,933 I 157 157] gcs_actor_manager.cc:152: Creating actor, job id = 01000000, actor id = 0bc4ac5663903402bd6255a101000000
[2021-08-20 00:32:55,933 I 157 157] gcs_actor_scheduler.cc:214: Start leasing worker from node 236d47b662acc9229f2bb5a1fcbdce1497132b1fe66563df7bc98a8a for actor 0bc4ac5663903402bd6255a101000000, job id = 01000000
[2021-08-20 00:32:56,037 I 157 157] gcs_actor_scheduler.cc:474: Finished leasing worker from 236d47b662acc9229f2bb5a1fcbdce1497132b1fe66563df7bc98a8a for actor 0bc4ac5663903402bd6255a101000000, job id = 01000000
[2021-08-20 00:32:56,037 I 157 157] gcs_actor_scheduler.cc:214: Start leasing worker from node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50 for actor 0bc4ac5663903402bd6255a101000000, job id = 01000000
[2021-08-20 00:32:57,099 I 157 157] gcs_actor_scheduler.cc:474: Finished leasing worker from 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50 for actor 0bc4ac5663903402bd6255a101000000, job id = 01000000
[2021-08-20 00:32:57,099 I 157 157] gcs_actor_scheduler.cc:318: Start creating actor 0bc4ac5663903402bd6255a101000000 on worker a7fd46032aad983a3f97a016823880b11b46f61a62d7d97a13e981b3 at node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50, job id = 01000000
[2021-08-20 00:32:57,268 I 157 157] gcs_actor_scheduler.cc:356: Succeeded in creating actor 0bc4ac5663903402bd6255a101000000 on worker a7fd46032aad983a3f97a016823880b11b46f61a62d7d97a13e981b3 at node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50, job id = 01000000
[2021-08-20 00:32:57,268 I 157 157] gcs_actor_manager.cc:894: Actor created successfully, actor id = 0bc4ac5663903402bd6255a101000000, job id = 01000000
[2021-08-20 00:32:57,269 W 157 157] gcs_worker_manager.cc:37: Reporting worker failure, worker id = a7fd46032aad983a3f97a016823880b11b46f61a62d7d97a13e981b3, node id = 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50, address = 10.233.88.122, exit_type = CREATION_TASK_ERROR, has creation task exception = 1. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
[2021-08-20 00:32:57,269 I 157 157] gcs_actor_manager.cc:681: Worker a7fd46032aad983a3f97a016823880b11b46f61a62d7d97a13e981b3 on node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50 exited, type=CREATION_TASK_ERROR, has creation_task_exception = 1
[2021-08-20 00:32:57,269 I 157 157] gcs_actor_manager.cc:686: Formatted creation task exception: Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 493, in ray._raylet.execute_task

  File "/tmp/ray/session_2021-08-18_09-19-22_924133_143/runtime_resources/conda/ray-7017d8b6299f500a0f00a1a7e1953cf87d30ffd1/lib/python3.7/site-packages/ray/_private/memory_monitor.py", line 152, in raise_if_low_memory
    self.error_threshold))

ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ray-operator-ray-worker-type-n5k9s is used (23.24 / 24.0 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
152     0.18GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=
7648    0.13GiB ray::***Actor
8768    0.13GiB ray::***Actor
8068    0.13GiB ray::***Actor
7228    0.13GiB ray::***Actor
9048    0.13GiB ray::***Actor
7683    0.13GiB ray::***Actor
8943    0.13GiB ray::***Actor
7123    0.13GiB ray::***Actor
8698    0.13GiB ray::***Actor

In addition, up to 0.21 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---



During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 640, in ray._raylet.task_execution_handler

  File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 604, in ray._raylet.execute_task

ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::SERVE_CONTROLLER_ACTOR:***Actor#qZFakO:RayServeWrappedReplica.__init__ (pid=7265, ip=10.233.88.122)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ray-operator-ray-worker-type-n5k9s is used (23.24 / 24.0 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
152     0.18GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=
7648    0.13GiB ray::***Actor
8768    0.13GiB ray::***Actor
8068    0.13GiB ray::***Actor
7228    0.13GiB ray::***Actor
9048    0.13GiB ray::***Actor
7683    0.13GiB ray::***Actor
8943    0.13GiB ray::***Actor
7123    0.13GiB ray::***Actor
8698    0.13GiB ray::***Actor

In addition, up to 0.21 GiB of shared memory is currently being used by the Ray object store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---

[2021-08-20 00:32:57,269 I 157 157] gcs_actor_manager.cc:817: Actor is failed 0bc4ac5663903402bd6255a101000000 on worker a7fd46032aad983a3f97a016823880b11b46f61a62d7d97a13e981b3 at node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50, need_reschedule = 0, remaining_restarts = 0, job id = 01000000
[2021-08-20 00:32:57,269 I 157 157] gcs_actor_manager.cc:156: Finished creating actor, job id = 01000000, actor id = 0bc4ac5663903402bd6255a101000000
[2021-08-20 00:32:57,269 I 157 157] gcs_actor_manager.cc:552: Destroying actor, actor id = 0bc4ac5663903402bd6255a101000000, job id = 01000000
[2021-08-20 00:32:57,359 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#qZFakO' was not found.
[2021-08-20 00:32:57,466 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#qZFakO' was not found.
[2021-08-20 00:32:57,466 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#qZFakO' was not found.
[2021-08-20 00:32:57,467 I 157 157] gcs_placement_group_scheduler.cc:533: Cancelling all committed bundles of a placement group, id is 38f226e18e232c86f23efa922e6ed897
[2021-08-20 00:32:57,467 I 157 157] gcs_placement_group_manager.cc:330: Placement group of an id, 38f226e18e232c86f23efa922e6ed897 is removed successfully.
[2021-08-20 00:32:57,620 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:F***Actor#jKIwYZ' was not found.
[2021-08-20 00:32:57,620 W 157 157] gcs_actor_manager.cc:244: Actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#jKIwYZ' was not found.
[2021-08-20 00:32:57,621 I 157 157] gcs_resource_manager.cc:60: Updating resources, node id = 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50
[2021-08-20 00:32:57,621 I 157 157] gcs_placement_group_manager.cc:249: Successfully created placement group SERVE_CONTROLLER_ACTOR:***Actor#jKIwYZ_placement_group, id: 1c4cf618185a124ecdbfed13439ea111
[2021-08-20 00:32:57,622 I 157 157] gcs_actor_manager.cc:128: Registering actor, job id = 01000000, actor id = 9bd72013394a902d7796502201000000
[2021-08-20 00:32:57,622 W 157 157] task_spec.cc:48: More than 43582 types of tasks seen, this may reduce performance.
[2021-08-20 00:32:57,623 I 157 157] gcs_actor_manager.cc:133: Registered actor, job id = 01000000, actor id = 9bd72013394a902d7796502201000000
[2021-08-20 00:32:57,623 I 157 157] gcs_actor_manager.cc:152: Creating actor, job id = 01000000, actor id = 9bd72013394a902d7796502201000000
[2021-08-20 00:32:57,623 I 157 157] gcs_actor_scheduler.cc:214: Start leasing worker from node 236d47b662acc9229f2bb5a1fcbdce1497132b1fe66563df7bc98a8a for actor 9bd72013394a902d7796502201000000, job id = 01000000
[2021-08-20 00:32:57,750 I 157 157] gcs_actor_scheduler.cc:474: Finished leasing worker from 236d47b662acc9229f2bb5a1fcbdce1497132b1fe66563df7bc98a8a for actor 9bd72013394a902d7796502201000000, job id = 01000000
[2021-08-20 00:32:57,750 I 157 157] gcs_actor_scheduler.cc:214: Start leasing worker from node 4e6c24939ccf59929941f478efbb888a87eac37768bfb5deda45ae50 for actor 9bd72013394a902d7796502201000000, job id = 01000000

worker-697c74c78c7da879497a7f73c31e02a5b5bb2d4720ed1977f5545f5e-01000000-2712.err

Crowded with logs like this:

2021-08-20 00:30:10,990 ERROR controller.py:121 -- Exception updating backend state: Failed to look up actor with name 'SERVE_CONTROLLER_ACTOR:***Actor#xJPBqd'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. The actor hasn't been created because named actor creation is asynchronous. 4. You did not use a namespace matching the namespace of the actor.
2021-08-20 00:30:11,203 WARNING backend_state.py:961 -- Replica ***Actor#xJPBqd of backend ***Actor failed health check, stopping it.
2021-08-20 00:30:11,336 INFO backend_state.py:869 -- Adding 1 replicas to backend '***Actor'.

Note: the stars are only replacing the correct name of the actor

python-core-worker-697c74c78c7da879497a7f73c31e02a5b5bb2d4720ed1977f5545f5e_2712.log

There are also the memory errors plus additionally this logs

[2021-08-20 00:37:17,178 I 2712 2740] direct_actor_transport.cc:209: Failing pending tasks for actor d46e8c777e2b8dc193801df901000000 because the actor is already dead.
[2021-08-20 00:37:17,178 I 2712 2740] direct_actor_transport.cc:228: Failing tasks waiting for death info, size=0, actor_id=d46e8c777e2b8dc193801df901000000
[2021-08-20 00:37:17,178 E 2712 2740] task_manager.cc:332: Task failed: IOError: 14: failed to connect to all addresses: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.serve.backend_worker, class_name=create_backend_replica.<locals>.RayServeWrappedReplica, function_name=reconfigure, function_hash=}, task_id=44d7caa722182bf8d46e8c777e2b8dc193801df901000000, task_name=RayServeWrappedReplica.reconfigure(), job_id=01000000, num_args=2, num_returns=2, actor_task_spec={actor_id=d46e8c777e2b8dc193801df901000000, actor_caller_id=ffffffffffffffff3b9f6d9ab9876eee2dfa3ace01000000, actor_counter=0}
[2021-08-20 00:37:17,263 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 0dc974cfe6fe3fa034f63c5c4e8d1d6d02845f640100000001000000.
[2021-08-20 00:37:17,264 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 3d72feee038d88643bfe0671e3af71a55b1584b50100000001000000.
[2021-08-20 00:37:17,264 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 3fc5765f28f7f7c5359085ad861107b34e3bb8b80100000001000000.
[2021-08-20 00:37:17,265 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: b5c6a89a26b50d1f12dce30f3183420ba45ff5400100000001000000.
[2021-08-20 00:37:17,265 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 4e7aee99935d0b8a58d28a93005346fe11734e6e0100000001000000.
[2021-08-20 00:37:17,265 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 03aed5d430a72d3ca06b2e773754e547b986f47e0100000001000000.
[2021-08-20 00:37:17,265 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 368ad177071667779826334a8da4e12487c34b610100000001000000.
[2021-08-20 00:37:17,265 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 610e5d40d358be6c538b25390efd0ad7fbca84370100000001000000.
[2021-08-20 00:37:17,266 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 0c71e9759ead297b2d01c5de8a5a7051b7b0f8c30100000001000000.
[2021-08-20 00:37:17,266 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 1039812dfc46e3f554c0c7269cce228ca50168010100000001000000.
[2021-08-20 00:37:17,266 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: b630dc4954aaa59c9c6ffb8dfdc88875ac1753a30100000001000000.
[2021-08-20 00:37:17,266 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 8ddaa4bb57bac286eedf2a8930d299e50bdb6c150100000001000000.
[2021-08-20 00:37:17,266 W 2712 2771] plasma_store_provider.cc:137: Trying to put an object that already existed in plasma: 558c29d6471c8fdea89067c7c363b50c5db2a9120100000001000000.
edoakes

comment created time in a month

PR opened deepinsight/insightface

Catch Module Not Found Error for Pypandoc

Installation of pypandoc should not be required for installing the python package, it is only required for the readme. However when it is not installed it might throw a moduleNotFoundError which should be catched also.

+1 -1

0 comment

1 changed file

pr created time in a month

push eventTanjaBayer/insightface

Tanja Bayer

commit sha 25d418b99e6a77dbaaf9c0dfb65e137042d11cd8

Catch Module Not Found Error for Pypandoc

view details

push time in a month

fork TanjaBayer/insightface

State-of-the-art 2D and 3D Face Analysis Project

https://insightface.ai

fork in a month

PR opened flowaccount/nx-plugins

Use lowercase to ignore also accept upper case YARN and NPM

Right now using yarn as package manager is not working to wrong comparison of the name string.

This results in completely ignoring of package versions and fall back to latest version. This especially bad when the latest version of a package can not be used, and results in a complete failure of the lambda funcions (e.g. elasticsearch package)

+2 -2

0 comment

1 changed file

pr created time in a month

push eventTanjaBayer/nx-plugins

Tanja Bayer

commit sha bcd1c37fa587170765e6dec93f040ee2bb939fba

Use strict equality

view details

push time in a month

push eventTanjaBayer/nx-plugins

Tanja Bayer

commit sha 16d6e28e5ec4a4756fa1eb2ec4c380aecac72a02

Change comparison operator

view details

push time in a month

push eventTanjaBayer/nx-plugins

Tanja Bayer

commit sha 22baf18ef70a2a3d71ab61263e8825f876e02393

Also ignore case for npm

view details

push time in a month

PR opened pascalbayer/nx-plugins

Fix check for packager string

Packager name is YARN, use toLowerCase to catch all cases

+1 -1

0 comment

1 changed file

pr created time in a month