Updated 9.30 BST March 30, 2020, with comment from Western Digital.
Numerous Solid-State Drives (SSDs) made by SanDisk suffer from a flaw that can see them wiping out everything stored on them at 40,000 hours (four years) — with HPE today joining Dell in naming SanDisk owner Western Digital as responsible for the bug, which has seen system administrators scramble to find and fix affected servers.
Neglecting to get a firmware fix in “will result in drive failure and data loss at 40,000 hours of operation and require restoration of data from backup if there is no fault tolerance, such as RAID 0 or even in a fault tolerance RAID mode if more SSDs fail than can be supported by the fault tolerance of the RAID mode on the logical drive” HPE said.
It added: “After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.” (Many experienced teams will build stacks with non-sequential serial numbers and storage products from different vendors, but that’s not always easy…)
HPE guidance for customers posted on March 20 said that based on its analysis of when servers equipped with the SanDisk SSDs started shipping, customers shouldn’t suffer issues before October 2020; giving end-users plenty of time to make the critical patch before their drives get bricked. Other OEMs are likely to be affected.
(Computer Business Review has not yet seen any further customer advisories. If you got one from another server vendor, get in touch with our editor…)
Hey, Western Digital: Thanks for That
Today naming Western Digital for the first time (an earlier statement had just cited a “Solid State Drive manufacturer), HPE told Computer Business Review in an emailed statement: “HPE was notified by Western Digital of a manufacturer firmware defect in certain SAS SSD models used across the industry.
“Because this defect only causes drive failure after 40,000 hours of operation, no HPE customers are in danger of failing for several months. HPE has received Serial Number information on the drives delivered to HPE customers, and we are actively reaching out to those customers and to provide updated firmware.”
A Western Digital spokesperson told Computer Business Review: “Delivering high-quality, highly reliable storage solutions is our number one priority.
“Western Digital discovered a firmware issue in a specific line of older, end-of-life SanDisk SAS SSDs, and pre-emptively contacted and began collaborating with our OEM partners to quickly provide a solution for their customers. A firmware fix for this issue is available for use by customers. As part of our policy, we cannot comment any further. Any questions about OEM products should be directed to the OEM. Thank you.”
SanDisk SSD Bug: Dell Told Customers in February
Dell meanwhile notified its customers in February, emailing them to say that it had “identified a potentially critical issue where certain solid state drives may experience failure and potential data loss due to an issue with the drives’ firmware, the drives may fail after approximately 40,000 hours of usage.”
SanDisk drives ranging from 200GB to 1.6TB are understood to be affected. These can be found in a sprawling array of Dell and HPE servers: both companies have furnished users with a full list of impacted products.
HPE has made Linux, VMware, and Windows scripts available which perform an SSD drive firmware check for the 40,000 power-on-hours failure issue, as has Dell, which pointed the finger at SanDisk model numbers LT0200MO, LT0400MO, LT0800MO, LT1600MO, LT0200WM, LT0400WM, LT0800WM, LT0800RO and LT1600RO.
Attentive systems administrators should have little trouble identifying the servers affecting and patching them in a hopefully bug-free manner, but the issue is frustrating for major OEMs like Dell and HPE; which face having to identify and notify all impacted customers — and which will no doubt take the brunt of any criticism.