Friday, December 28, 2007
Stanford has told us we must take vacation during the holidays as a cost-saving measure, and yet we must release software. So here we are! We get to gossip about the holidays.
Our Sakai / CourseWork upgrades are, on the whole, pretty easy. More time is taken giving the new deployment a live shakedown. Perhaps Mary Mak can manage to get automated testing in place - it would be great to have robots crawl the site after deployment.
The procedure was that QA did the final build for release, and Julian Morley dropped it onto a private preprod box. There he reviewed any property changes. If a table had changed Sam would of cut a clone of the production schema before hand. Julian would apply the Operations lens to the deployment and look for issues.
On release day QA would run any of the tested SQL necessary to prepare / update the databases, and then Operations would manage the servers.
Ops takes the Big5 Load Balancer thingy and puts up the standard "out of service" link. This goes to some static HTML in the Stanford www server pool. Then one of the tomcat pool is chosen as a candidate. The tomcat from the ops-private preprod box is dumped onto the candidate machine, and started up. The non-load-balanced machine name is used for smoke testing.
Cost cutting measures have lead to a loss of that level of software release process. The only real difference is that developers now do the final build & SQL tweaking. This provides the double edged sword of having people around who can tweak the deployment target as it is slammed into production. (When QA ran the show they would rightly punt the deploy of something had a bad smell. Now it can be hacked. You decide how you want to run your institution. :) )
In this case we're rollout a build containing a Sakai-provisional tool of our own making. It is an alternate "home page" tool intended to be used in course sites. Lydia is handling the DB conversion. We're also doing some CourseWork Classic conversion, and Julian has wrangled a blob dump. (we switched to extracted blobs long ago - summer? - but kept them in the DB. the dump now allows us to move the tables and recover the space from disk.)
When we're not doing database table changes lots of the ritual is disposed of. The process boils down to a tomcat drop, a restart and a smoke test.
I'm expecting this to go out without any operational difficulties.
Wednesday, December 19, 2007
To speed up login this past Fall we've taken a few steps. The first was to constrain Realm / Role resolution to the current Term. That was nice, as the Role Resolvers didn't have to crawl across all possible Term/Section combinations for Instructors. But then we pushed it further and removed all calls to refreshUser from the login sequence. ZIP. Due to the lack of Role checking of any sort our users got in quickly and our database as not being reduced to a heap of smoldering salmon. The user Site membership was pulled from SAKAI_SITE_USER w/out an issue, and when the user navigated to the site their roles were pulled. (Sakai recalculates this stuff all the time.)
At the end of the Term we've found that a set of users were loosing their site membership. It was those users who had had Support perform a "Become User" operation in resolving various issues. The Become User operation was using the Term aware changes we had added and stripping the User from all the past term Sites for which they were provided. Since the Term dates ( see the jira about 'Term effective dates' ) had passed, officially, the Sites for that Term were being stripped of those users who had Support issues warranting a Become User session.
Pretty darn confusing.
Especially as each CM Sync job, to work around other bugs in Sakai, was doing a Site Save and a toggle of Section Administration settings to keep the Realm info up to date. These pushed the users back INTO the sites they were in. Depending on when the user logged in they may or may not of had access to their previous quarters Sites.
Here is my log from the internal Sakai confluence site:
Working steps. done in my dev instance, against my dev oracle database.
- bring up Stan2.4.x_D
- login in as self
- review test site Su07-xoxo-001-01
- review Su07 term - it's not active. but my instructor membership in the site shows the site on my pulldown (the tab area is full)
this gives me a list of folks. I am the instructor and cwrks223 is a student.
I move to another computer and log in as admin/admin.
- login as admin/admin
- test cwrks223 account via SU 'LDAP Peek' to see if account looks OK. it does.
- SU to cwrks223
At this point the Site is not listed.
I move to the first computer
- on first computer I log out of Caseyd1 session
- log in as cwrks223
- review tabls *No Su07-XOXO-002
what's wrong with this? I didn't log in as cwrks223 FIRST. doh. notice that there was no syncing going on at this point.
- on second computer I log in as caseyd1
- I review membership in Su07-XOXO-001-01. cwrks223 is missing.
- using SQL Developer I examine database tables:
- user is still in CM data
- user is not in SAKAI_SITE_USER for this site
- user is not in REALMS for this site
select * from sakai_realm
where realm_key in
(select realm_key from sakai_realm_rl_gr where user_id = 'cwrks223')
-- works for this user in my DB due to sakai_id == EID for user
This confirms that SU strips sakai site associates for out of term sites.
- on the second computer I run Site Sync for Su07-X0X0
- on the first computer ( cwrks223 account ) I nav to a tab
- on the first computer my site membership returns
This confirms that Sync Site, when it refreshes the Site against the updated CM data, restores the user to the various Sakai side records.
- in SQL developer I repeat the above query
- the user is now in the top level Realm for the site
- the user is now in the proper section Realm for the site.
- on the first computer I navigate to the Membership tool. cwrks223 membership now shows the proper list.
to test "update participants' refreshing
- on second computer SU to cwrks223
- on second computer cwrks223 confirm that I don't see the Su07-XOXO site
- on first computer cwrks223 tab-nav to a site, and lose Su07-XOXO membership
- on second computer logout cwrks223 and login caseyd1
- on second computer caseyd1 tab-nav to Su07-XOXO and Update Participants.
- on second computer caseyd1 see cwrks223 return to membership list.
- on first computer cwrks223 tab-navs to a site, and regains Su07-XoXO membership
this confirms that an admin or instructor using Update Participants for a site restores sakak-lost members by refreshing against unchanged CM data.
Dialing in on the problem leaves us with BecomeUser being the starting point for investigation. Our Term-based optimizations are being used by the Sakai Kernel to remove prior terms for those users who have been "Become Usered" by our Support team. well that sucks.
A patch is to remove the refreshUser call from Become User. The privs are actually calculated when the user navs to the Site in question. think 'lazy loading.' :)
This is the second place where we've removed the call. The first is in the login sequence. I think that Sakai is resolving User membership rights when the user navs to the site just fine.---
I think that the 2.1 era addition of the SAKAI_SITE_USER table allows us to drop a lot of the confusing Realm scrubbing. It just wasn't reviewed at the time.
Tuesday, December 18, 2007
One of the hard parts of working with our QA crew is getting a common model of how Sakai operates informing their efforts. So much seems magical / inconsistent / random that they often throw up their hands in the bug reports and trail off into "what did we miss?" and "this happens now and again..."
The practice, with stable code, should be deterministic.
One source of the problem is in basing QA on clones of production databases and using the production integration points. This practice comes from laziness. I can understand it as QA is in reality the second tier of support at Stanford. (there is only 1 support person for our few thousand users. Go cost cutting measures!) One reason QA uses live data is because so often they are pulled in to examine production issues.
As a result the development of formal test cases has been seen as a luxury.
The production of formal test cases would follow an understanding of the Sakai models - and internalizing that is just not going to happen in such an environment. One would need a couple more QA people focusing solely on QA, and not "where is my worksite? oh under that pulldown?" issues.
The construction is astounding. The weekend traffic was boggling. The snow was nice; spent an afternoon tubing with Child while younger kids sledded and M1 snowshoed with Mr Cox, singlespeedjane, JoeB, and TonkaFastButt. JimmyD and Kate oscillated back and forth.
Conversation surrounded changes in our lives; it seems that a lot of things are up in the air.
Tuesday, December 11, 2007
As I didn't ( still don't ) have QA LDAP servers to work with I had to hope that it worked. In retrospect I could of played with my DNS settings to make some fail to resolve...
During a scheduled outage of an LDAP server (during finals week? I don't get it. ) today it seems that tearing down the entire secure connection context isn't enough to avoid the sticky IP problem. As a result post Kerberos AuthN work against the LDAP pool wasn't working correctly for a good number of users - and they couldn't completely log into Sakai. We're not sure if the TTL for the DNS entry was set low enough before the outage, but assuming it was I expected the new from-the-bottom connection to do a full DNS request and get only the good IPs.
It looks like we'll have to set a couple of JVM properties: networkaddress.cache.ttl being the first one. The default is to have the JVM cache successful DNS resolutions forever. So even if the JAAS context was built up each time the JVM would still of been giving it the wrong machine to work with. Ugh.
Some Stanford teams actually roll through IP #s for LDAP connections, recovering from bad contexts by sidestepping the DNS loadbalancer altogether.
Friday, December 7, 2007
In itself this is 'fine' but it has the result of leaving all the Section Aware Tool ACLs, stuffed away in sad little XML packages hither and yon, referring to a non-existing Group. Sadness can result, as the changes to the membership are not reflected in the Tools ACL.
So some folks can't get their stuff who should be able to, and others who should not get stuff no longer theirs may.
I am told that when all the XML is dissolved into columns this problem "may" go away. Altho I doubt it: some tool-wide notification mechanism will be necessary. (That would be fun to join in on, perhaps in my upcoming free time when I'm no longer employed by Stanford.) And I need a 2.4.x era fix.
I'm going to try the old Switcheroo in the SectionManagerImpl - let it create the new group thingy but then rename it to the old, deleted thingy's name. evil. yes. If that works it's an easy hack we can get into our pre-break release.
Then I can look at a real solution.
Wednesday, December 5, 2007
The blog posting, http://bfish.xaedalus.net/?p=239 , works fine - but expect some oddness as this is all early release stuff.
I haven't looked at the Sakai Schedule / Calendar tool, but a Lightning Provider for Sakai would be sweet.
More information on Lightning can be found at the Mozilla Wiki's Calendar area.
Tuesday, December 4, 2007
I, however am chasing Production and QA issues :P
The Cloaking Device is in place for our load testing. It's been laid over a snapshot of Production.. now we have thousands of 'fake people' to allow our vendors and consultants to work with.
However we have a problem with our 2.4.x_D1 deployments - Lydia's is not working properly and mine is. After I fix production issues I'll be back looking at our svn:externs to see if the current QA tag is pulling something different.
The symptom is that her deployment is not showing the cloaked users. Mine does. Weird.