Skip to content
This repository was archived by the owner on Jan 7, 2023. It is now read-only.
This repository was archived by the owner on Jan 7, 2023. It is now read-only.

run intel caffe using multi-node with mlsl on AMD cpus ,stopped at Iteration 0 #19

@Tron-x

Description

@Tron-x

when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok.
image
when i htop on evry node
image
my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt

I think something is wrong with mlsl ,my mlsl version is
image
because when i run with my own openmpi,it is ok

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions