More after I figure out Input/Output errors for CIFS shares under Linux.
So it turns out that this could have been one of three reasons. The errors happened after the file was transferred and upon the 'file close' operation which I assume to mean writing metadata. The Ubuntu VM had the wrong timezone configured so either (1) windows was rejecting the file date as it was in the future or (2) HD sentinel was also installed on the Windows VM and it was configured by default to modify the drive icon with one that showed a bar graph on capacity, and the file close operation conflicted with the bar graph icon being updated or (3) some kind of NTFS corruption since response times were in the 1000's of milliseconds as this drive was on PCH lanes.
I fixed the timezone, closed HD Sentinel, and selected another folder on another drive for sharing and there are no more errors. I'll experiment later to pinpoint which of these three it was.
Plotting times seem to have stabilized at between 5:30 to 6:00 per plot with pechy's combined tree chiapos, which is pretty excellent. I have 16 concurrent plots configured with a 30 minute stagger, and a max limit of 4 plots in phase 1 but the queue never exceeds 11-12 plots since they finish so fast. I'll try lowering the stagger to 25 or 20 mins later and see if that increases the number of concurrent plots.
The bane of SMR drives is that if and when the CMR cache runs out then writing speeds drop to a positively ancient number at under 10MB/s, and it may take several hours for the CMR cache to be flushed out. HDD manufacturers have stated that SMR is here to stay, and that it should be countered in software. I'm guessing that means pooling writes between multiple drives, so that each individual drive has time to recover and flush it's CMR cache.
So then it's recommended to have a staging drive, an SSD that buffers all the plots and writes them at it's leisure to the mechanical drives. Otherwise the queue may be held up by a plot taking almost an hour to copy. Swar's plot manager does allow you to configure multiple drives to be used in a round robin fashion (sequentially) but then again, the number of plots per day suffers if even one of them encounters the CMR cache limit, and with multiple SMR drives you could potentially have many destination drives holding up the plotting queue. So an SSD cache is the better option for high output plotting (vs multiple destination drives). But now we need to get plots off the staging drive and to our destination drives and at the same time avoiding hitting the CMR cache limit of each destination drive.
MergerFS has policies that unintentionally accomplish this, I'm using the 'most free space' policy. Let's say I have a MergerFS pool of 8 drives with this policy, and they're all empty. When I copy a plot, that 101GB goes to one of the drives in the pool. Then I wait 30mins for the next plot to finish, and MergerFS assigns one of the other 7 drives. And then for the third plot, one of the 6 remaining drives. And so on. So it takes about 4 hours/8 plots, before one of the drives is written to a second time. But it could be just as well #8 instead of #1, and if it's #8 then the CMR cache wouldn't have had time to flush and we're back to super slow transfer speeds.
Budgetary limitations prevent me from investing in another SSD, so I'm using 1.5TB of NVMe storage that's on another VM, in another host that's connected to this host with the plotting VM by 2.5G ethernet. I've rate limited this connection to 200MB/s for QoS reasons (this other host has 100+ VM's that need
some bandwith). So with this setup, we see network utilization graphs like this:
The green areas are this Staging VM receiving the plots, and the blue areas are the plots being transferred back to the host of the OpenMediaVault VM with MergerFS (which incidentally, is the same host as the Plotting VM's. So basically, these plots go on a long drive through a scenic route before reaching their permanent home. Also in this graph are healthy gaps, which means we haven't reached the CMR cache of any of the drives yet.
To monitor the staging drive and automatically transfer drives over, I'm using a very simple powershell script that calls upon robocopy. I found this chia2drive.ps1 script over at
https://kostya.blog/2021/05/01/full-chia-plotting-automation-in-windows/ and it's been working flawlessly.
I've also started pointing other plotting VM's at this staging drive to streamline the plot copying process, but this has lead to some (manageable) traffic congestion:
The green areas are plots coming in the staging drive and the blue areas are the automatic copying of these plots to the destination drives, it's set to a 15 minute timer so that explains the gap in the blue.
Overall this setup, while complicated, is working beautifully.