FAQ

Frequently asked questions

Q: The trainer stopped and hasn’t progressed in several minutes.

A: Usually an issue with the GPUs communicating with each other. See the NCCL doc

Q: Exitcode -9

A: This usually happens when you run out of system RAM.

Q: Exitcode -7 while using deepspeed

A: Try upgrading deepspeed w: pip install -U deepspeed

Q: AttributeError: ‘DummyOptim’ object has no attribute ‘step’

A: You may be using deepspeed with single gpu. Please don’t set deepspeed: in yaml or cli.