Linux and SSDrives: shit might happen?
Page 1 of 1
Invasor
Moderator



Posts: 7638
Location: On the road
PostPosted: Wed, 17th Jun 2015 01:05    Post subject: Linux and SSDrives: shit might happen?
Quote:
...the indexing process crashed. Since the indexing process is guarded by supervise, crashing in a loop would have been understandable but a complete crash was not. As it turned out the filesystem was in a read-only mode. All right, let’s assume it was a cosmic ray Smile the filesystem got fixed, files were restored from another healthy server and everything looked fine again.

The next day another server ended with filesystem in read-only, two hours after another one and then next hour another one. Something was going on.
...
Not a single day without corruptions
While more and more machines were dying, we had managed to automate the restore procedure to a level we were comfortable with. At every failure, we tried to look at different patterns of the corruption in hopes that we would find the smallest common denominator. They all had the same characteristics. But one thing started to be more and more clear – we saw the issue only on a portion of our servers. The software stack was identical but the hardware was slightly different. Mainly the SSDs were different but they were all from the same manufacturer.
...
As it turned out, the lost data was always 512 bytes, which is one block on the drive. One step further, a block ends up to be full of zeroes. A hardware bug? Or is the block zeroed? What can zero the block? TRIM! Trim instructs the SSD drive to zero the empty blocks. But these block were not empty and other types of SSDs were not impacted. We gave it a try and disabled TRIM across all of our servers. It would explain everything!

The next day not a single server was corrupted, two days silence, then a week. The nightmare was over! At least we thought so… a month after we isolated the problem, a server restarted and came up with corrupted data but only from the small files – including certificates. Even improper shutdown cannot cause this.

Poking around in the source code of the kernel looking for the trim related code, we came to the trim blacklist. This blacklist configures a specific behavior for certain SSD drives and identifies the drives based on the regexp of the model name. Our working SSDs were explicitly allowed full operation of the TRIM but some of the SSDs of our affected manufacturer were limited. Our affected drives did not match any pattern so they were implicitly allowed full operation.
...
TL;DR

UPDATE June 16:
A lot of discussions started pointing out that the issue is related to the newly introduced queued TRIM. This is not correct. The TRIM on our drives is un-queued and the issue we have found is not related to the latest changes in the Linux Kernel to disable this features.

Broken SSDs:

SAMSUNG MZ7WD480HCGM-00003
SAMSUNG MZ7GE480HMHP-00003
SAMSUNG MZ7GE240HMGR-00003
Samsung SSD 840 PRO Series
recently blacklisted for 8-series blacklist
Samsung SSD 850 PRO 512GB
recently blacklisted as 850 Pro and later in 8-series blacklist

Working SSDs:

Intel S3500
Intel S3700
Intel S3710

salsa

I'm on the list. This is currently my only pc and my work depends on it. Fuck me hard.


Last edited by Invasor on Wed, 17th Jun 2015 18:16; edited 1 time in total
Back to top
Invasor
Moderator



Posts: 7638
Location: On the road
PostPosted: Wed, 17th Jun 2015 01:12    Post subject:
btw, if this shit hits the fan for me it will be my own fault for buying a samsung product again after I already told myself I wouldn't ever again. Sad
Back to top
Epsilon
Dr. Strangelove



Posts: 9240
Location: War Room
PostPosted: Wed, 17th Jun 2015 01:13    Post subject:
Odd, been running a Samsung 840 pro for two years now I think. Reboot every evening.
Mounted with discard for trimming. It's never given me issues and I've not seen any kind of data corruption. Smart tool gives no errors either.
Back to top
Invasor
Moderator



Posts: 7638
Location: On the road
PostPosted: Wed, 17th Jun 2015 01:45    Post subject:
is it ext4?
Back to top
Epsilon
Dr. Strangelove



Posts: 9240
Location: War Room
PostPosted: Wed, 17th Jun 2015 02:00    Post subject:
Invasor wrote:
is it ext4?

Yep.
Back to top
Invasor
Moderator



Posts: 7638
Location: On the road
PostPosted: Wed, 17th Jun 2015 02:06    Post subject:
maybe these guys just had really bad luck then? Smile
Back to top
Epsilon
Dr. Strangelove



Posts: 9240
Location: War Room
PostPosted: Wed, 17th Jun 2015 02:14    Post subject:
Invasor wrote:
maybe these guys just had really bad luck then? Smile

Could be a hardware combination, like specific chipset and harddrive controller. I don't think it's the ssd.
Back to top
Invasor
Moderator



Posts: 7638
Location: On the road
PostPosted: Wed, 17th Jun 2015 14:53    Post subject:
Epsilon wrote:
Invasor wrote:
maybe these guys just had really bad luck then? Smile

Could be a hardware combination, like specific chipset and harddrive controller. I don't think it's the ssd.


hmm, wouldn't that cause issues with windows too? I haven't heard about any...
Back to top
Epsilon
Dr. Strangelove



Posts: 9240
Location: War Room
PostPosted: Wed, 17th Jun 2015 15:58    Post subject:
Invasor wrote:
Epsilon wrote:
Invasor wrote:
maybe these guys just had really bad luck then? Smile

Could be a hardware combination, like specific chipset and harddrive controller. I don't think it's the ssd.


hmm, wouldn't that cause issues with windows too? I haven't heard about any...

Not necessarely.
Back to top
Shoshomiga




Posts: 2378
Location: Bulgaria
PostPosted: Wed, 17th Jun 2015 16:24    Post subject: I have left.
I have left.
Back to top
Page 1 of 1 All times are GMT + 1 Hour
NFOHump.com Forum Index - Operating Systems
Signature/Avatar nuking: none (can be changed in your profile)  


Display posts from previous:   

Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group